R语言程序设计week4

来源:互联网 发布:那些引擎源码能用 编辑:程序博客网 时间:2024/05/17 01:35

4.1 str函数

str:紧凑地显示R对象的内在结构

一种诊断函数同时也是"summary"的替代品

特别适用于紧密地展示列表的内容

大致是一行一个基本对象

如str(lm)

4.2 simulation 模拟

Generating Random Numbers

Functions for probability distributions in R

  • rnorm: generate random Normal variates with a given mean and standard deviation
  • dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
  • pnorm: evaluate the cumulative distribution function for a Normal distribution
  • qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)与pnorm相反
  • 同样,也有柏松分布
  • dpois(x, lambda, log = FALSE)
    ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)
    qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)
    rpois(n, lambda)
  • 亦有 The Binomial Distribution二项分布
  • dbinom(x, size, prob, log = FALSE)
    pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
    qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
    rbinom(n, size, prob)
 

Generating Random Numbers

Probability distribution functions usually have four functions associated with them. The functions are prefixed with a

  • d for density
  • r for random number generation
  • p for cumulative distribution
  • q for quantile function

Generating Random Numbers

Working with the Normal distributions requires using these four functions

dnorm(x, mean = 0, sd = 1, log = FALSE)pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)rnorm(n, mean = 0, sd = 1)

Generating Random Numbers

Setting the random number seed with set.seed ensures reproducibility

> set.seed(1)> rnorm(5)[1] -0.6264538  0.1836433 -0.8356286  1.5952808[5]  0.3295078> rnorm(5)[1] -0.8204684  0.4874291  0.7383247  0.5757814[5] -0.3053884> set.seed(1)> rnorm(5)[1] -0.6264538  0.1836433 -0.8356286  1.5952808[5]  0.3295078

Always set the random number seed when conducting a simulation!

Generating Random Numbers

Generating Poisson data

> rpois(10, 1) [1] 3 1 0 1 0 0 1 0 1 1> rpois(10, 2) [1] 6 2 2 1 3 2 2 1 1 2> rpois(10, 20) [1] 20 11 21 20 20 21 17 15 24 20> ppois(2, 2)  ## Cumulative distribution[1] 0.6766764  ## Pr(x <= 2)> ppois(4, 2)[1] 0.947347   ## Pr(x <= 4)> ppois(6, 2)[1] 0.9954662  ## Pr(x <= 6)

Random Sampling

The sample function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions.

> set.seed(1)> sample(1:10, 4)[1] 3 4 5 7> sample(1:10, 4)[1] 3 9 8 5> sample(letters, 5)[1] "q" "b" "e" "x" "p"> sample(1:10)  ## permutation [1] 4 710 6 9 2 8 3 1 5 > sample(1:10) [1]  2  3  4  1  9  5 10  8  6  7> sample(1:10, replace = TRUE)  ## Sample w/replacement [1] 2 9 7 8 2 8 5 9 7 8

Simulation

Summary

  • Drawing samples from specific probability distributions can be done with r* functions
  • Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
  • The sample function can be used to draw random samples from arbitrary vectors
  • Setting the random number generator seed via set.seed is critical for reproducibility

4.3R profiler(R分析器)
在处理有些数据的时候,需要大量的时间,而这个R profiler则能指出到底是耗在哪儿了

Why is My Code So Slow?

  • Profiling is a systematic way to examine how much time is spend in different parts of a program

  • Useful when trying to optimize your code

  • Often code runs fine once, but what if you have to put it in a loop for 1,000 iterations? Is it still fast enough?

  • Profiling is better than guessing

On Optimizing Your Code

  • Getting biggest impact on speeding up code depends on knowing where the code spends most of its time

  • This cannot be done without performance analysis or profiling

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil

--Donald Knuth

General Principles of Optimization

  • Design first, then optimize

  • Remember: Premature optimization is the root of all evil  过早的优化是万恶之源

  • Measure (collect data), don’t guess.

  • If you're going to be scientist, you need to apply the same principles here!

Using system.time()

  • Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression

  • Computes the time (in seconds) needed to execute an expression

    • If there’s an error, gives time until the error occurred
  • Returns an object of class proc_time

    • user time: time charged to the CPU(s) for this expression
    • elapsed time: "wall clock" time  即我们所能感知到的时间

Using system.time()

  • Usually, the user time and elapsed time are relatively close, for straight computing tasks

  • Elapsed time may be greater than user time if the CPU spends a lot of time waiting around

  • Elapsted time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them)

    • Multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
    • Parallel processing via the parallel package(并行处理)

Using system.time()

## Elapsed time > user time(主要是因为要花时间连接数据跟返回数据,所以才会这样system.time(readLines("http://www.jhsph.edu"))   user  system elapsed   0.004   0.002   0.431 ## Elapsed time < user time(the underlying linear algebra library split the computation across the two cores)hilbert <- function(n) {         i <- 1:n        1 / outer(i - 1, i, "+”)}x <- hilbert(1000)system.time(svd(x))   user  system elapsed   1.605   0.094   0.742  

Timing Longer Expressions

system.time({    n <- 1000    r <- numeric(n)    for (i in 1:n) {        x <- rnorm(n)        r[i] <- mean(x)    }})

Beyond system.time()

  • Using system.time() allows you to test certain functions or code blocks to see if they are taking excessive amounts of time

  • Assumes you already know where the problem is and can call system.time() on it

  • What if you don’t know where to start?

The R Profiler

  • The Rprof() function starts the profiler in R

    • R must be compiled with profiler support (but this is usually the case)
  • The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not readable)

  • DO NOT use system.time() and Rprof() together or you will be sad

The R Profiler

  • Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates(作表) how much time is spend in each function

  • Default sampling interval is 0.02 seconds

  • NOTE: If your code runs very quickly, the profiler is not useful, but then you probably don't need it in that case

R Profiler Raw Output

## lm(y ~ x)sample.interval=10000"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm" "list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm" "list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm" "list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm" "lm.fit" "lm" "lm.fit" "lm" "lm.fit" "lm" 

Using summaryRprof()

  • The summaryRprof() function tabulates the R profiler output and calculates how much time is spend in which function

  • There are two methods for normalizing the data

  • "by.total" divides the time spend in each function by the total run time

  • "by.self" does the same but first subtracts out time spent in functions above in the call stack

  • 顶层往往只调用函数,而不做事,因此它所花的时间往往并不那么重要,真正重要的是做事的那些函数所花的进间,故而by.self

By Total

$by.total                        total.time total.pct self.time self.pct##pct,即所占百分比,此处lm是总共的,故而花时最长"lm"                          7.41    100.00      0.30     4.05"lm.fit"                      3.50     47.23      2.99    40.35"model.frame.default"         2.24     30.23      0.12     1.62"eval"                        2.24     30.23      0.00     0.00"model.frame"                 2.24     30.23      0.00     0.00"na.omit"                     1.54     20.78      0.24     3.24"na.omit.data.frame"          1.30     17.54      0.49     6.61"lapply"                      1.04     14.04      0.00     0.00"[.data.frame"                1.03     13.90      0.79    10.66"["                           1.03     13.90      0.00     0.00"as.list.data.frame"          0.82     11.07      0.82    11.07"as.list"                     0.82     11.07      0.00     0.00

By Self

$by.self                        self.time self.pct total.time total.pct"lm.fit"                     2.99    40.35       3.50     47.23"as.list.data.frame"         0.82    11.07       0.82     11.07"[.data.frame"               0.79    10.66       1.03     13.90"structure"                  0.73     9.85       0.73      9.85"na.omit.data.frame"         0.49     6.61       1.30     17.54"list"                       0.46     6.21       0.46      6.21"lm"                         0.30     4.05       7.41    100.00"model.matrix.default"       0.27     3.64       0.79     10.66"na.omit"                    0.24     3.24       1.54     20.78"as.character"               0.18     2.43       0.18      2.43"model.frame.default"        0.12     1.62       2.24     30.23"anyDuplicated.default"      0.02     0.27       0.02      0.27

summaryRprof() Output

$sample.interval[1] 0.02$sampling.time[1] 7.41

Summary

  • Rprof() runs the profiler for performance of analysis of R code

  • summaryRprof() summarizes the output of Rprof() and gives percent of time spent in each function (with two types of normalization)

  • Good to break your code into functions so that the profiler can give useful information about where time is being spent

  • C or Fortran code is not profiled


0 0
原创粉丝点击