统计推断week2

来源:互联网 发布:淘宝助手下载6.0 编辑:程序博客网 时间:2024/04/30 17:51

2.1 条件概率

Conditional probability, motivation

  • The probability of getting a one when rolling a (standard) die(掷骰子) is usually assumed to be one sixth
  • Suppose you were given the extra information that the die roll was an odd number (hence 1, 3 or 5)
  • conditional on this new information, the probability of a one is now one third
`


Diagnostic tests

  • Let +and - be the events that the result of a diagnostic test is positive or negative respectively
  • Let D and D^c be the event that the subject of the test has or does not have the disease respectively
  • The sensitivity is the probability that the test is positive given that the subject actually has the disease, P(+ | D)
  • The specificity is the probability that the test is negative given that the subject does not have the disease, P(- |D^c)

More definitions

  • The positive predictive value is the probability that the subject has the disease given that the test is positive, P(D |+)
  • The negative predictive value is the probability that the subject does not have the disease given that the test is negative, P(D^c | -)
  • The prevalence(流行) of the disease is the marginal probability of disease, P(D)

More definitions

  • The diagnostic likelihood ratio of a positive test, labeled DLR_+, is P(+ |D) / P(+| D^c), which is the sensitivity / (1 - specificity)
  • The diagnostic likelihood ratio of a negative test, labeled DLR_-, is P(- | D) / P(- | D^c), which is the (1 - sensitivity) / specificity

Example

  • A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity(特异性) of 98.5%
  • Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the probability that this subject has HIV?
  • Mathematically, we want P(D | +) given the sensitivity, P(+ | D) = .997, the specificity, P(- | D^c) =.985, and the prevalence P(D) = .001
故而实为阳性的情况下诊断为阳性的是p(+|D)=0.997,p(-|D^c)=0.985,p(D)=.001

More on this example

  • The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity
  • Suppose it was known that the subject was an intravenous drug user(静脉注射吸毒者) and routinely had intercourse(交互) with an HIV infected partner
  • Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes
这就好比说,一个吸毒者得艾滋的机会就比较高即p(D)较大,而一个生活规律的人的p(D)则较小,此处p(D)为综合所有人的。




HIV example revisited

  • Suppose a subject has a positive HIV test
  • DLR+ = .997 / (1 - .985) 约等于 66
  • The result of the positive test is that the odds of disease is now 66 times the pretest odds
  • Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease

HIV example revisited

  • Suppose that a subject has a negative test result
  • $DLR_- = (1 - .997) / .985 约等于.003
  • Therefore, the post-test odds of disease is now $.3\%$ of the pretest odds given the negative test.
  • Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result

2.2 常见的分布


n <- 5pvals <- seq(0, 1, length = 1000)plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood")text((0 : n) /n, 1.1, as.character(0 : n))sapply(0 : n, function(x) {  phat <- x / n  if (x == 0) lines(pvals,  ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3)  else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3)  else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3)   })title(paste("Likelihoods for n = ", n))
type="n"即不显示点









Other properties

  • The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal)
  • A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?)
  • Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?)
  • Sample means of normally distributed random variables are again normally distributed (with what mean and variance?)
  • The square of a standard normal random variable follows what is called chi-squared distribution
  • The exponent of a normally distributed random variables follows what is called the log-normal distribution
  • As we will see later, many random variables, properly normalized, limit to a normal distribution



Some uses for the Poisson distribution

  • Modeling event/time data
  • Modeling radioactive decay
  • Modeling survival data
  • Modeling unbounded count data
  • Modeling contingency tables
  • Approximating binomials when $n$ is large and $p$ is small


Poisson derivation

  • $\lambda$ is the mean number of events per unit time
  • Let $h$ be very small
  • Suppose we assume that
    • Prob. of an event in an interval of length $h$ is $\lambda h$ while the prob. of more than one event is negligible(微不足道)
    • Whether or not an event occurs in one small interval does not impact whether or not an event occurs in another small interval then, the number of events per unit time is Poisson with mean $\lambda$

Example

The number of people that show up at a bus stop is Poisson with a mean of $2.5$ per hour.

If watching the bus stop for 4 hours, what is the probability that $3$ or fewer people show up for the whole time?

ppois(3, lambda = 2.5 * 4)
[1] 0.01034

Example, Poisson approximation to the binomial

We flip a coin with success probablity $0.01$ five hundred times.

What's the probability of 2 or fewer successes?

pbinom(2, size = 500, prob = .01)
[1] 0.1234
ppois(2, lambda=500 * .01)
[1] 0.1247

2.3 渐近性

Asymptotics

  • Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
  • (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.)
  • Asymptotics are incredibly useful for simple statistical inference and approximations
  • (Not covered in this class) Asymptotics often lead to nice understanding of procedures
  • Asymptotics generally give no assurances about finite sample performance
    • The kinds of asymptotics that do are orders of magnitude more difficult to work with
  • Asymptotics form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs)
  • To understand asymptotics, we need a very basic understanding of limits.

Numerical limits

  • Imagine a sequence

    • a1 = .9,
    • a2 = .99,
    • a3 = .999, ...
  • Clearly this sequence converges to 1

  • Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on

Discussion

  • An estimator is consistent(一致性)
  •  if it converges to what you want to estimate
    • Consistency is neither necessary nor sufficient for one estimator to be better than another
    • Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer
  • The LLN basically states that the sample mean is consistent
  • The sample variance and the sample standard deviation are consistent as well
  • Recall also that the sample mean and the sample variance are unbiased as well
  • (The sample standard deviation is biased, by the way)


par(mfrow = c(1, 3))
for (n in c(1, 2, 6)){
  temp <- matrix(sample(1 : 6, n * 10000, replace = TRUE), ncol = n)
  temp <- apply(temp, 1, mean)
  temp <- (temp - 3.5) / (1.71 / sqrt(n))
  dty <- density(temp)
  plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5))
  title(paste("sample mean of", n, "obs"))
  lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3)
  lines(dty$x, dty$y, lwd = 2)
}



par(mfrow = c(2, 3))
for (n in c(1, 10, 20)){
  temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE), ncol = n)
  temp <- apply(temp, 1, mean)
  temp <- (temp - .5) * 2 * sqrt(n)
  dty <- density(temp)
  plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5))
  title(paste("sample mean of", n, "obs"))
  lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3)
  lines(dty$x, dty$y, lwd = 2)
}
for (n in c(1, 10, 20)){
  temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE, prob = c(.9, .1)), ncol = n)
  temp <- apply(temp, 1, mean)
  temp <- (temp - .1) / sqrt(.1 * .9 / n)
  dty <- density(temp)
  plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5))
  title(paste("sample mean of", n, "obs"))
  lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3)
  lines(dty$x, dty$y, lwd = 2)
}



Give a confidence interval for the average height of sons

in Galton's data

library(UsingR);data(father.son); x <- father.son$sheight(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12
[1] 5.710 5.738

Example

  • Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for you.
    • Can you relax? Do you have this race in the bag?
    • Without access to a computer or calculator, how precise is this estimate?
  • 1/sqrt(100)=.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66)
    • Not enough for you to relax, better go do more campaigning!
  • Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.
round(1 / sqrt(10 ^ (1 : 6)), 3)
[1] 0.316 0.100 0.032 0.010 0.003 0.001

R code

x <- 5; t <- 94.32; lambda <- x / tround(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)
[1] 0.007 0.099
poisson.test(x, T = 94.32)$conf
[1] 0.01721 0.12371attr(,"conf.level")[1] 0.95

In the regression class

exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log))))
  2.5 %  97.5 % 0.01901 0.11393 

2.4 卡方分布

Confidence intervals

  • In the previous, we discussed creating a confidence interval using the CLT
  • In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution
  • To discuss the $t$ distribution we must discuss the Chi-squared distribution
  • Throughout we use the following general procedure for creating CIs

    a. Create a Pivot or statistic that does not depend on the parameter of interest

    b. Solve the probability that the pivot lies between bounds for the parameter





Example

Confidence interval for the standard deviation of sons' heights from Galton's data

library(UsingR)data(father.son)x <- father.son$sheights <- sd(x)n <- length(x)round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3)
## [1] 2.701 2.939



Note's about the $t$ interval

  • The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption
  • It works well whenever the distribution of the data is roughly symmetric and mound shaped
  • Paired observations are often analyzed using the $t$ interval by taking differences
  • For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded
  • For skewed distributions, the spirit of the $t$ interval assumptions are violated
  • Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean
  • In this case, consider taking logs or using a different summary like the median
  • For highly discrete data, like binary, other intervals are available

Sleep data

In R typing data(sleep) brings up the sleep data originally analyzed in Gosset's Biometrika paper, which shows the increase in hours for 10 patients on two soporific drugs. R treats the data as two groups rather than paired.


The data

data(sleep)head(sleep)
##   extra group ID## 1   0.7     1  1## 2  -1.6     1  2## 3  -0.2     1  3## 4  -1.2     1  4## 5  -0.1     1  5## 6   3.4     1  6

g1 <- sleep$extra[1:10]g2 <- sleep$extra[11:20]difference <- g2 - g1mn <- mean(difference)s <- sd(difference)n <- 10mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n)
## [1] 0.7001 2.4599
t.test(difference)$conf.int
## [1] 0.7001 2.4599## attr(,"conf.level")## [1] 0.95


0 0
原创粉丝点击