R语言笔记四

来源:互联网 发布:甲醛无色无味 知乎 编辑:程序博客网 时间:2024/05/17 23:17

str function
str: Compacktly display the internal structure of an R object

  • A diagnostic function and an alternative to "summary"
  • It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists.
  • Roughly one line per basic object
> str(str)function (object, ...)  > str(lm)function (formula, data, subset, weights, na.action, method = "qr", model = TRUE,     x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,     offset, ...)  > str(ls)function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,     pattern, sorted = TRUE)  > x <- rnorm(100,2,4)> summary(x)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. -5.8020 -0.8004  2.2660  2.0920  4.3450 13.2600 > str(x) num [1:100] 5.6 2.71 3 5.35 -5.61 ...> f<- gl(40,10)> str(f) Factor w/ 40 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...> summary(f) 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 27 28 29 30 31 32 33 34 35 36 37 38 39 40 10 10 10 10 10 10 10 10 10 10 10 10 10 10 > library(datasets)> head(airquality)  Ozone Solar.R Wind Temp Month Day1    41     190  7.4   67     5   12    36     118  8.0   72     5   23    12     149 12.6   74     5   34    18     313 11.5   62     5   45    NA      NA 14.3   56     5   56    28      NA 14.9   66     5   6> str(airquality)'data.frame':   153 obs. of  6 variables: $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ... $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ... $ Month  : int  5 5 5 5 5 5 5 5 5 5 ... $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...> m<-matrix(rnorm(100),10,10)> str(m) num [1:10, 1:10] -0.706 1.507 0.171 1.577 -1.295 ...> m[,1] [1] -0.7060449  1.5071743  0.1711907  1.5765175 -1.2952681 -0.3800775 [7] -0.2051980  1.7714065 -1.8690810  1.1779065> s<-split(airquality, airquality$Month)> str(s)List of 5 $ 5:'data.frame': 31 obs. of  6 variables:  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ... $ 6:'data.frame': 30 obs. of  6 variables:  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ... $ 7:'data.frame': 31 obs. of  6 variables:  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ... $ 8:'data.frame': 31 obs. of  6 variables:  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ... $ 9:'data.frame': 30 obs. of  6 variables:  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...

Simulation

Generating Random Numbers

Functions for probability distributions in R

  • rnorm: generate random Normal variates with a given mean and standard deviation
  • dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
  • pnorm: evaluate the cumulative distribution function for a Normal distribution
  • rpois: generate random Poisson variates with a given rate

Probability distribution functions usually have four functions associated with them. The functions are prefixed with a

  • d for density
  • r for random number generation
  • p for cumulative distribution
  • q for quantile function

Working with the Normal distributions requires using these four functions

dnorm(x, mean = 0, sd = 1, log = FALSE)pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)rnorm(n, mean = 0, sd = 1)> x<-rnorm(10)> x [1] -0.7035158  2.0931550 -1.5634806 -0.1795591 -0.6867991 -0.2751694 [7] -1.2051665  2.2951353 -0.5438907  0.4239533> x<-rnorm(10, 20, 2)> x [1] 17.81015 19.16094 19.54943 20.14397 22.13014 23.95761 23.51140 17.84877 [9] 19.09331 21.67001> summary(x)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   17.81   19.11   19.85   20.49   22.02   23.96 

Setting the random number seed with set.seed ensures reproducibility

> set.seed(1)> rnorm(5)[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078> rnorm(5)[1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884> set.seed(1)> rnorm(5)[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Always set the random number seed when conducting a simulation!

Generating Poisson data

> rpois(10, 1) [1] 3 1 0 1 0 0 1 0 1 1> rpois(10, 2) [1] 6 2 2 1 3 2 2 1 1 2> rpois(10, 20) [1] 20 11 21 20 20 21 17 15 24 20> ppois(2, 2) ## Cumulative distribution[1] 0.6766764 ## Pr(x <= 2)> ppois(4, 2)[1] 0.947347 ## Pr(x <= 4)> ppois(6, 2)[1] 0.9954662 ## Pr(x <= 6)

Generating Random Numbers From a Linear Generating Random Numbers From a Linear Model

Suppose we want to simulate from the following linear model

> set.seed(20)> x<-rnorm(100)> e<-rnorm(100,0,2)> y <- 0.5 + 2 * x + e> summary(y)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. -6.4080 -1.5400  0.6789  0.6893  2.9300  6.5050 > plot(x,y)

What if x is binary?

> set.seed(10)> x<-rbinom(100,1,0.5)> e<-rnorm(100,0,2)> y<-0.5 + 2 * x + e> summary(y)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. -3.4940 -0.1409  1.5770  1.4320  2.8400  6.9410 > plot(x,y)

Suppose we want to simulate from a Poisson model where

Y ~ Poisson(μ)
log μ =β0+ β1x
and β0 = 0.5 and β1 = 0.3. We need to use the rpois function for this

> set.seed(1)> x<-rnorm(100)> log.mu<-0.5 + 0.3 * x> y <- rpois(100, exp(log.mu))> summary(y)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    0.00    1.00    1.00    1.55    2.00    6.00 > plot(x,y)

Random Sampling
The sample function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions.

> set.seed(1)> sample(1:10, 4)[1] 3 4 5 7> sample(1:10, 4)[1] 3 9 8 5> sample(letters, 5)[1] "q" "b" "e" "x" "p"> sample(1:10) ## permutation [1] 4 710 6 9 2 8 3 1 5> sample(1:10) [1] 2 3 4 1 9 5 10 8 6 7> sample(1:10, replace = TRUE) ## Sample w/replacement [1] 2 9 7 8 2 8 5 9 7 8

Summary

  • Drawing samples from specific probability distributions can be done with r* functions
  • Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
  • The sample function can be used to draw random samples from arbitrary vectors
  • Setting the random number generator seed via set.seed is critical for reproducibility

R Profiler

Why is My Code So Slow? Why is My Code So Slow?

  • Profiling is a systematic way to examine how much time is spend in different parts of a program
  • Useful when trying to optimize your code
  • Often code runs fine once, but what if you have to put it in a loop for 1,000 iterations? Is it still fast enough?
  • Profiling is better than guessing

On Optimizing Your Code On Optimizing Your Code

  • Getting biggest impact on speeding up code depends on knowing where the code spends most of its time
  • This cannot be done without performance analysis or profiling

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil –Donald Knuth

General Principles of Optimization General Principles of Optimiz

  • Design first, then optimize
  • Remember: Premature optimization is the root of all evil
  • Measure (collect data), don’t guess
  • If you're going to be scientist, you need to apply the same principles here!

Using system.time()

  • Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression
  • Computes the time (in seconds) needed to execute an expression Returns an object of class proc_time
    • If there’s an error, gives time until the error occurred
  • Returns an object of class proc_time
    user time: time charged to the CPU(s) for this expression
    elapsed time: “wall clock” time

  • Usually, the user time and elapsed time are relatively close, for straight computing tasks

  • Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
  • Elapsted time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them)
    • Multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
    • Parallel processing via the parallel package
## Elapsed time > user timesystem.time(readLines("http://www.jhsph.edu"))    user system elapsed    0.004 0.002 0.431## Elapsed time < user timehilbert <- function(n) {      i <- 1:n      1 / outer(i - 1, i, "+”)}x <- hilbert(1000)system.time(svd(x))     user system elapsed    1.605 0.094 0.742 

Timing Longer Expressions Timing Longer Expressions

system.time({ n <- 1000 r <- numeric(n) for (i in 1:n) { x <- rnorm(n) r[i] <- mean(x) }})## user system elapsed## 0.097 0.002 0.099

Beyond system.time()

  • Using system.time() allows you to test certain functions or code blocks to see if they are taking excessive amounts of time
  • Assumes you already know where the problem is and can call system.time() on it
  • What if you don’t know where to start?

The R Profiler

  • The Rprof() function starts the profiler in R
  • R must be compiled with profiler support (but this is usually the case)
  • The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not readable)
  • DO NOT use system.time() and Rprof() together or you will be sad

R Profiler Raw Output R Profiler Raw Output

## lm(y ~ x)sample.interval=10000"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm""lm.fit" "lm""lm.fit" "lm""lm.fit" "lm"

Using summaryRprof()

  • The summaryRprof() function tabulates the R profiler output and calculates how much time is spend in which function
  • There are two methods for normalizing the data
  • "by.total" divides the time spend in each function by the total run time
  • "by.self" does the same but first subtracts out time spent in functions above in the call stack

By Total

By Self

summaryRprof() Output

$sample.interval[1] 0.02$sampling.time[1] 7.41

Summary

  • Rprof() runs the profiler for performance of analysis of R code
  • summaryRprof() summarizes the output of Rprof() and gives percent of time spent in each function (with two types of normalization)
  • Good to break your code into functions so that the profiler can give useful information about where time is being spent
  • C or Fortran code is not profiled
0 0
原创粉丝点击