R语言笔记四
来源:互联网 发布:甲醛无色无味 知乎 编辑:程序博客网 时间:2024/05/17 23:17
str function
str: Compacktly display the internal structure of an R object
A diagnostic function and an alternative to "summary"
It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists.
Roughly one line per basic object
> str(str)function (object, ...) > str(lm)function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) > str(ls)function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, pattern, sorted = TRUE) > x <- rnorm(100,2,4)> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -5.8020 -0.8004 2.2660 2.0920 4.3450 13.2600 > str(x) num [1:100] 5.6 2.71 3 5.35 -5.61 ...> f<- gl(40,10)> str(f) Factor w/ 40 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...> summary(f) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 27 28 29 30 31 32 33 34 35 36 37 38 39 40 10 10 10 10 10 10 10 10 10 10 10 10 10 10 > library(datasets)> head(airquality) Ozone Solar.R Wind Temp Month Day1 41 190 7.4 67 5 12 36 118 8.0 72 5 23 12 149 12.6 74 5 34 18 313 11.5 62 5 45 NA NA 14.3 56 5 56 28 NA 14.9 66 5 6> str(airquality)'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ...> m<-matrix(rnorm(100),10,10)> str(m) num [1:10, 1:10] -0.706 1.507 0.171 1.577 -1.295 ...> m[,1] [1] -0.7060449 1.5071743 0.1711907 1.5765175 -1.2952681 -0.3800775 [7] -0.2051980 1.7714065 -1.8690810 1.1779065> s<-split(airquality, airquality$Month)> str(s)List of 5 $ 5:'data.frame': 31 obs. of 6 variables: ..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ... $ 6:'data.frame': 30 obs. of 6 variables: ..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ... ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ... ..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ... ..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ... ..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ... ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ... $ 7:'data.frame': 31 obs. of 6 variables: ..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ... ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ... ..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ... ..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ... ..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ... ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ... $ 8:'data.frame': 31 obs. of 6 variables: ..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ... ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ... ..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ... ..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ... ..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ... ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ... $ 9:'data.frame': 30 obs. of 6 variables: ..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ... ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ... ..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ... ..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ... ..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ... ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
Simulation
Generating Random Numbers
Functions for probability distributions in R
rnorm: generate random Normal variates with a given mean and standard deviation
dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
pnorm: evaluate the cumulative distribution function for a Normal distribution
rpois: generate random Poisson variates with a given rate
Probability distribution functions usually have four functions associated with them. The functions are prefixed with a
d for density
r for random number generation
p for cumulative distribution
q for quantile function
Working with the Normal distributions requires using these four functions
dnorm(x, mean = 0, sd = 1, log = FALSE)pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)rnorm(n, mean = 0, sd = 1)> x<-rnorm(10)> x [1] -0.7035158 2.0931550 -1.5634806 -0.1795591 -0.6867991 -0.2751694 [7] -1.2051665 2.2951353 -0.5438907 0.4239533> x<-rnorm(10, 20, 2)> x [1] 17.81015 19.16094 19.54943 20.14397 22.13014 23.95761 23.51140 17.84877 [9] 19.09331 21.67001> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 17.81 19.11 19.85 20.49 22.02 23.96
Setting the random number seed with set.seed ensures reproducibility
> set.seed(1)> rnorm(5)[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078> rnorm(5)[1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884> set.seed(1)> rnorm(5)[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Always set the random number seed when conducting a simulation!
Generating Poisson data
> rpois(10, 1) [1] 3 1 0 1 0 0 1 0 1 1> rpois(10, 2) [1] 6 2 2 1 3 2 2 1 1 2> rpois(10, 20) [1] 20 11 21 20 20 21 17 15 24 20> ppois(2, 2) ## Cumulative distribution[1] 0.6766764 ## Pr(x <= 2)> ppois(4, 2)[1] 0.947347 ## Pr(x <= 4)> ppois(6, 2)[1] 0.9954662 ## Pr(x <= 6)
Generating Random Numbers From a Linear Generating Random Numbers From a Linear Model
Suppose we want to simulate from the following linear model
> set.seed(20)> x<-rnorm(100)> e<-rnorm(100,0,2)> y <- 0.5 + 2 * x + e> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050 > plot(x,y)
What if x is binary?
> set.seed(10)> x<-rbinom(100,1,0.5)> e<-rnorm(100,0,2)> y<-0.5 + 2 * x + e> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.4940 -0.1409 1.5770 1.4320 2.8400 6.9410 > plot(x,y)
Suppose we want to simulate from a Poisson model where
Y ~ Poisson(μ)
log μ =β0+ β1x
and β0 = 0.5 and β1 = 0.3. We need to use the rpois function for this
> set.seed(1)> x<-rnorm(100)> log.mu<-0.5 + 0.3 * x> y <- rpois(100, exp(log.mu))> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 1.00 1.00 1.55 2.00 6.00 > plot(x,y)
Random Sampling
The sample function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions.
> set.seed(1)> sample(1:10, 4)[1] 3 4 5 7> sample(1:10, 4)[1] 3 9 8 5> sample(letters, 5)[1] "q" "b" "e" "x" "p"> sample(1:10) ## permutation [1] 4 710 6 9 2 8 3 1 5> sample(1:10) [1] 2 3 4 1 9 5 10 8 6 7> sample(1:10, replace = TRUE) ## Sample w/replacement [1] 2 9 7 8 2 8 5 9 7 8
Summary
Drawing samples from specific probability distributions can be done with r* functions
Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
The sample function can be used to draw random samples from arbitrary vectors
Setting the random number generator seed via set.seed is critical for reproducibility
R Profiler
Why is My Code So Slow? Why is My Code So Slow?
Profiling is a systematic way to examine how much time is spend in different parts of a program
Useful when trying to optimize your code
Often code runs fine once, but what if you have to put it in a loop for 1,000 iterations? Is it still fast enough?
Profiling is better than guessing
On Optimizing Your Code On Optimizing Your Code
Getting biggest impact on speeding up code depends on knowing where the code spends most of its time
This cannot be done without performance analysis or profiling
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil –Donald Knuth
General Principles of Optimization General Principles of Optimiz
Design first, then optimize
Remember: Premature optimization is the root of all evil
Measure (collect data), don’t guess
If you're going to be scientist, you need to apply the same principles here!
Using system.time()
Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression
Computes the time (in seconds) needed to execute an expression Returns an object of class proc_time
- If there’s an error, gives time until the error occurred
Returns an object of class proc_time
user time: time charged to the CPU(s) for this expression
elapsed time: “wall clock” timeUsually, the user time and elapsed time are relatively close, for straight computing tasks
Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
Elapsted time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them)
- Multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
- Parallel processing via the parallel package
## Elapsed time > user timesystem.time(readLines("http://www.jhsph.edu")) user system elapsed 0.004 0.002 0.431## Elapsed time < user timehilbert <- function(n) { i <- 1:n 1 / outer(i - 1, i, "+”)}x <- hilbert(1000)system.time(svd(x)) user system elapsed 1.605 0.094 0.742
Timing Longer Expressions Timing Longer Expressions
system.time({ n <- 1000 r <- numeric(n) for (i in 1:n) { x <- rnorm(n) r[i] <- mean(x) }})## user system elapsed## 0.097 0.002 0.099
Beyond system.time()
Using system.time() allows you to test certain functions or code blocks to see if they are taking excessive amounts of time
Assumes you already know where the problem is and can call system.time() on it
What if you don’t know where to start?
The R Profiler
The Rprof() function starts the profiler in R
R must be compiled with profiler support (but this is usually the case)
The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not readable)
DO NOT use system.time() and Rprof() together or you will be sad
R Profiler Raw Output R Profiler Raw Output
## lm(y ~ x)sample.interval=10000"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""lm.fit" "lm""lm.fit" "lm""lm.fit" "lm"
Using summaryRprof()
The summaryRprof() function tabulates the R profiler output and calculates how much time is spend in which function
There are two methods for normalizing the data
"by.total" divides the time spend in each function by the total run time
"by.self" does the same but first subtracts out time spent in functions above in the call stack
By Total
By Self
summaryRprof() Output
$sample.interval[1] 0.02$sampling.time[1] 7.41
Summary
Rprof() runs the profiler for performance of analysis of R code
summaryRprof() summarizes the output of Rprof() and gives percent of time spent in each function (with two types of normalization)
Good to break your code into functions so that the profiler can give useful information about where time is being spent
C or Fortran code is not profiled
- R语言笔记四
- R语言学习笔记(四)
- R语言学习笔记(四)
- R语言学习笔记(四)
- R语言学习(四)
- R语言学习四
- R语言学习四
- 《R语言经典示例》学习笔记(四)
- 《R语言入门与实践》学习笔记四
- R语言学习笔记
- R语言笔记1
- R语言笔记
- R语言学习笔记
- R语言个人笔记
- R语言学习笔记
- R语言入门笔记
- R语言笔记
- R语言初学笔记
- SWI 软中断详解
- 同步、异步、阻塞和非阻塞的概念
- SQL Server建立基本数据表
- ora-19624 ora-19502 ora-27040
- 经典的线程池
- R语言笔记四
- 我的Java学习之路
- Grid View 网格视图——翻译自developer.android.com api guides
- libev 学习笔记之主体事件循环
- 2016-06-15-课程设计日志
- kali 私人笔记
- [Python]Merge Intervals
- WebView基本使用方法
- CentOS7 安装 MariaDB