统计推断week2

来源：互联网发布：淘宝助手下载6.0 编辑：程序博客网时间：2024/04/30 17:51

2.1 条件概率

Conditional probability, motivation

The probability of getting a one when rolling a (standard) die(掷骰子) is usually assumed to be one sixth
Suppose you were given the extra information that the die roll was an odd number (hence 1, 3 or 5)
conditional on this new information, the probability of a one is now one third

Diagnostic tests

Let +and - be the events that the result of a diagnostic test is positive or negative respectively
Let D and D^c be the event that the subject of the test has or does not have the disease respectively
The sensitivity is the probability that the test is positive given that the subject actually has the disease, P(+ | D)
The specificity is the probability that the test is negative given that the subject does not have the disease, P(- |D^c)

More definitions

The positive predictive value is the probability that the subject has the disease given that the test is positive, P(D |+)
The negative predictive value is the probability that the subject does not have the disease given that the test is negative, P(D^c | -)
The prevalence(流行) of the disease is the marginal probability of disease, P(D)

More definitions

The diagnostic likelihood ratio of a positive test, labeled DLR_+, is P(+ |D) / P(+| D^c), which is the sensitivity / (1 - specificity)
The diagnostic likelihood ratio of a negative test, labeled DLR_-, is P(- | D) / P(- | D^c), which is the (1 - sensitivity) / specificity

Example

A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity(特异性) of 98.5%
Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the probability that this subject has HIV?
Mathematically, we want P(D | +) given the sensitivity, P(+ | D) = .997, the specificity, P(- | D^c) =.985, and the prevalence P(D) = .001

故而实为阳性的情况下诊断为阳性的是p(+|D)=0.997,p(-|D^c)=0.985，p(D)=.001

HIV example revisited

Suppose a subject has a positive HIV test
DLR+ = .997 / (1 - .985) 约等于 66
The result of the positive test is that the odds of disease is now 66 times the pretest odds
Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease

HIV example revisited

Suppose that a subject has a negative test result
$DLR_- = (1 - .997) / .985 约等于.003
Therefore, the post-test odds of disease is now $.3\%$ of the pretest odds given the negative test.
Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result

2.2 常见的分布

n <- 5pvals <- seq(0, 1, length = 1000)plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood")text((0 : n) /n, 1.1, as.character(0 : n))sapply(0 : n, function(x) {  phat <- x / n  if (x == 0) lines(pvals,  ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3)  else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3)  else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3)   })title(paste("Likelihoods for n = ", n))

type="n"即不显示点

Other properties

The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal)
A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?)
Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?)
Sample means of normally distributed random variables are again normally distributed (with what mean and variance?)
The square of a standard normal random variable follows what is called chi-squared distribution
The exponent of a normally distributed random variables follows what is called the log-normal distribution
As we will see later, many random variables, properly normalized, limit to a normal distribution

Some uses for the Poisson distribution

Modeling event/time data
Modeling radioactive decay
Modeling survival data
Modeling unbounded count data
Modeling contingency tables
Approximating binomials when $n$ is large and $p$ is small

Poisson derivation

$\lambda$ is the mean number of events per unit time
Let $h$ be very small
Suppose we assume that
- Prob. of an event in an interval of length $h$ is $\lambda h$ while the prob. of more than one event is negligible(微不足道)
- Whether or not an event occurs in one small interval does not impact whether or not an event occurs in another small interval then, the number of events per unit time is Poisson with mean $\lambda$

Example

The number of people that show up at a bus stop is Poisson with a mean of $2.5$ per hour.

If watching the bus stop for 4 hours, what is the probability that $3$ or fewer people show up for the whole time?

ppois(3, lambda = 2.5 * 4)

[1] 0.01034

Example, Poisson approximation to the binomial

We flip a coin with success probablity $0.01$ five hundred times.

What's the probability of 2 or fewer successes?

pbinom(2, size = 500, prob = .01)

[1] 0.1234

ppois(2, lambda=500 * .01)

[1] 0.1247

2.3 渐近性

Asymptotics

Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
(Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.)
Asymptotics are incredibly useful for simple statistical inference and approximations
(Not covered in this class) Asymptotics often lead to nice understanding of procedures
Asymptotics generally give no assurances about finite sample performance
- The kinds of asymptotics that do are orders of magnitude more difficult to work with
Asymptotics form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs)
To understand asymptotics, we need a very basic understanding of limits.

Numerical limits

Imagine a sequence
- a1 = .9,
- a2 = .99,
- a3 = .999, ...
Clearly this sequence converges to 1
Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on

Discussion

An estimator is consistent(一致性）
if it converges to what you want to estimate
- Consistency is neither necessary nor sufficient for one estimator to be better than another
- Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer
The LLN basically states that the sample mean is consistent
The sample variance and the sample standard deviation are consistent as well
Recall also that the sample mean and the sample variance are unbiased as well
(The sample standard deviation is biased, by the way)

par(mfrow = c(1, 3))
for (n in c(1, 2, 6)){
  temp <- matrix(sample(1 : 6, n * 10000, replace = TRUE), ncol = n)
  temp <- apply(temp, 1, mean)
  temp <- (temp - 3.5) / (1.71 / sqrt(n)) 
  dty <- density(temp)
  plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5))
  title(paste("sample mean of", n, "obs"))
  lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3)
  lines(dty$x, dty$y, lwd = 2)
}

par(mfrow = c(2, 3))
for (n in c(1, 10, 20)){
  temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE), ncol = n)
  temp <- apply(temp, 1, mean)
  temp <- (temp - .5) * 2 * sqrt(n)
  dty <- density(temp)
  plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5))
  title(paste("sample mean of", n, "obs"))
  lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3)
  lines(dty$x, dty$y, lwd = 2)
}
for (n in c(1, 10, 20)){
  temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE, prob = c(.9, .1)), ncol = n)
  temp <- apply(temp, 1, mean)
  temp <- (temp - .1) / sqrt(.1 * .9 / n)
  dty <- density(temp)
  plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5))
  title(paste("sample mean of", n, "obs"))
  lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3)
  lines(dty$x, dty$y, lwd = 2)
}

Give a confidence interval for the average height of sons

in Galton's data

library(UsingR);data(father.son); x <- father.son$sheight(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12

[1] 5.710 5.738

Example

Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for you.
- Can you relax? Do you have this race in the bag?
- Without access to a computer or calculator, how precise is this estimate?
1/sqrt(100)=.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66)
- Not enough for you to relax, better go do more campaigning!
Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.

round(1 / sqrt(10 ^ (1 : 6)), 3)

[1] 0.316 0.100 0.032 0.010 0.003 0.001

R code

x <- 5; t <- 94.32; lambda <- x / tround(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)

[1] 0.007 0.099

poisson.test(x, T = 94.32)$conf

[1] 0.01721 0.12371attr(,"conf.level")[1] 0.95

In the regression class

exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log))))

  2.5 %  97.5 % 0.01901 0.11393

2.4 卡方分布

Confidence intervals

In the previous, we discussed creating a confidence interval using the CLT
In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution
To discuss the $t$ distribution we must discuss the Chi-squared distribution
Throughout we use the following general procedure for creating CIs
a. Create a Pivot or statistic that does not depend on the parameter of interest
b. Solve the probability that the pivot lies between bounds for the parameter

Example

Confidence interval for the standard deviation of sons' heights from Galton's data

library(UsingR)data(father.son)x <- father.son$sheights <- sd(x)n <- length(x)round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3)

## [1] 2.701 2.939

Note's about the $t$ interval

The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption
It works well whenever the distribution of the data is roughly symmetric and mound shaped
Paired observations are often analyzed using the $t$ interval by taking differences
For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded
For skewed distributions, the spirit of the $t$ interval assumptions are violated
Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean
In this case, consider taking logs or using a different summary like the median
For highly discrete data, like binary, other intervals are available

Sleep data

In R typing data(sleep) brings up the sleep data originally analyzed in Gosset's Biometrika paper, which shows the increase in hours for 10 patients on two soporific drugs. R treats the data as two groups rather than paired.

The data

data(sleep)head(sleep)

##   extra group ID## 1   0.7     1  1## 2  -1.6     1  2## 3  -0.2     1  3## 4  -1.2     1  4## 5  -0.1     1  5## 6   3.4     1  6

g1 <- sleep$extra[1:10]g2 <- sleep$extra[11:20]difference <- g2 - g1mn <- mean(difference)s <- sd(difference)n <- 10mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n)

## [1] 0.7001 2.4599

t.test(difference)$conf.int

## [1] 0.7001 2.4599## attr(,"conf.level")## [1] 0.95

0 0

统计推断week2

Conditional probability, motivation

Diagnostic tests

More definitions

More definitions

Example

More on this example

HIV example revisited

HIV example revisited

Other properties

Some uses for the Poisson distribution

Poisson derivation

Example

Example, Poisson approximation to the binomial

Asymptotics

Numerical limits

Discussion

Give a confidence interval for the average height of sons

Example

R code

In the regression class

Confidence intervals

Example

Confidence interval for the standard deviation of sons' heights from Galton's data

Note's about the $t$ interval

Sleep data

The data