统计推断week3

来源：互联网发布：mysql error1054 编辑：程序博客网时间：2024/04/30 21:16

3.1 likehood 似然

Likelihood

A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution（此处一例，如x1,...,xn都服从N(u,o^2)的分布，这里的index 指的就是u和o
The likelihood of a collection of data is the joint density evaluated as a function of the parameters with the data fixed
Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter

Likelihood

Given a statistical probability mass function or density, say f(x, theta), where theta is an unknown parameter, the likelihood is f viewed as a function of theta for a fixed, observed value of x.

Interpretations of likelihoods

The likelihood has the following properties:

Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another.(如下面这个example)
Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood.
If {X} are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the Xi is simply the product of the individual likelihoods.

Plotting likelihoods

Generally, we want to consider all the values of theta between 0 and 1
A likelihood plot displays theta by L(theta,x)
Because the likelihood measures relative evidence, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation

重要的是它/dbnorm(3,4,3/4)

Example

You saw 5 failure events per 94 days of monitoring a nuclear pump.
Assuming Poisson, plot the likelihood

lambda <- seq(0, .2, length = 1000)likelihood <- dpois(5, 94 * lambda) / dpois(5, 5)plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda))lines(rep(5/94, 2), 0 : 1, col = "red", lwd = 3)lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2)lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2)

3.2 贝叶斯分析

## Exploring the beta densitylibrary(manipulate)pvals <- seq(0.01, 0.99, length = 1000)manipulate(    plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE),    alpha = slider(0.01, 10, initial = 1, step = .5),    beta = slider(0.01, 10, initial = 1, step = .5)    )

这里面的beta分布

hypertension高血压

pvals <- seq(0.01, 0.99, length = 1000)x <- 13; n <- 20myPlot <- function(alpha, beta){    plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE)    lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)),             lwd = 3, col = "darkred")    lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue")    lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))),        lwd = 3, col = "darkgreen")    title("red=prior,green=posterior,blue=likelihood")}manipulate(    myPlot(alpha, beta),    alpha = slider(0.01, 10, initial = 1, step = .5),    beta = slider(0.01, 10, initial = 1, step = .5)    )

Getting HPD intervals for this example

Install the \texttt{binom} package, then the command

library(binom)binom.bayes(13, 20, type = "highest")

  method  x  n shape1 shape2   mean  lower  upper  sig1  bayes 13 20   13.5    7.5 0.6429 0.4423 0.8361 0.05

gives the HPD interval.

The default credible level is $95\%$ and the default prior is the Jeffrey's prior.

pvals <- seq(0.01, 0.99, length = 1000)x <- 13; n <- 20myPlot2 <- function(alpha, beta, cl){    plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3,    xlab = "p", ylab = "", frame = FALSE)    out <- binom.bayes(x, n, type = "highest",         prior.shape1 = alpha,         prior.shape2 = beta,         conf.level = cl)    p1 <- out$lower; p2 <- out$upper    lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0),         type = "l", lwd = 3, col = "darkred")}manipulate(    myPlot2(alpha, beta, cl),    alpha = slider(0.01, 10, initial = 1, step = .5),    beta = slider(0.01, 10, initial = 1, step = .5),    cl = slider(0.01, 0.99, initial = 0.95, step = .01)    )

3.3 TwoGroupIntervals

Independent group t confidence intervals

Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo
We cannot use the paired t test because the groups are independent and may have different sample sizes
We now present methods for comparing independent groups

data(sleep)x1 <- sleep$extra[sleep$group == 1]x2 <- sleep$extra[sleep$group == 2]n1 <- length(x1)n2 <- length(x2)sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2))md <- mean(x1) - mean(x2)semd <- sp * sqrt(1 / n1 + 1/n2)md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd

[1] -3.3639  0.2039

t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf

[1] -3.3639  0.2039attr(,"conf.level")[1] 0.95

t.test(x1, x2, paired = TRUE)$conf

[1] -2.4599 -0.7001attr(,"conf.level")[1] 0.95

Comparing other kinds of data

For binomial data, there's lots of ways to compare two groups
- Relative risk, risk difference, odds ratio.
- Chi-squared tests, normal approximations, exact tests.
For count data, there's also Chi-squared tests and exact tests.
We'll leave the discussions for comparing groups of data for binary and count data until covering glms in the regression class.
In addition, Mathematical Biostatistics Boot Camp 2 covers many special cases relevant to biostatistics.

3.4 假设检验

Hypothesis testing

Hypothesis testing is concerned with making decisions using data
A null hypothesis is specified that represents the status quo, usually labeled H0
The null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis

Discussion

Consider a court of law; the null hypothesis is that the defendant is innocent
We require evidence to reject the null hypothesis (convict)
If we require little evidence, then we would increase the percentage of innocent people convicted (type I errors); however we would also increase the percentage of guilty people convicted (correctly rejecting the null)
If we require a lot of evidence, then we increase the the percentage of innocent people let free (correctly accepting the null) while we would also increase the percentage of guilty people let free (type II errors)

T test in R

library(UsingR); data(father.son)t.test(father.son$sheight - father.son$fheight)

    One Sample t-testdata:  father.son$sheight - father.son$fheightt = 11.79, df = 1077, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval: 0.831 1.163sample estimates:mean of x     0.997

Exact binomial test

Recall this problem, Suppose a friend has $8$ children, $7$ of which are girls and none are twins
Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$
- What is the relevant rejection region so that the probability of rejecting is (less than) 5%?

Rejection regionType I error rate[0 : 8]1[1 : 8]0.9961[2 : 8]0.9648[3 : 8]0.8555[4 : 8]0.6367[5 : 8]0.3633[6 : 8]0.1445[7 : 8]0.0352[8 : 8]0.0039

Notes

It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial.
- The closest is the rejection region [7 : 8]
- Any alpha level lower than 0.0039 is not attainable.
For larger sample sizes, we could do a normal approximation, but you already knew this.
Two sided test isn't obvious.
- Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW)
For these problems, people always create a P-value (next lecture) rather than computing the rejection region.

0 0