统计推断week1

来源:互联网 发布:mac 鼠标 编辑:程序博客网 时间:2024/05/01 02:23

1.1统计推断导论

统计推断是从数据中得出正式结论的过程(frormal conclusions)

在此课中,我们对正式的统计推断的定义是在考虑到不确定性的情况下,通过带有噪音的数据,从大量的事实中推断

In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy statistical data where uncertainty must be accounted for.

下面是两个例子,一为选举,二为更年期症状(Menopausal symptoms)  hormone激素

Motivating example: who's going to win the election?

In every major election, pollsters would like to know, ahead of the actual election, who's going to win. Here, the target of estimation (the estimand) is clear, the percentage of people in a particular group (city, state, county, country or other electoral grouping) who will vote for each candidate.

We can not poll everyone. Even if we could, some polled may change their vote by the time the election occurs. How do we collect a reasonable subset of data and quantify the uncertainty in the process to produce a good guess at who will win?

Motivating example: is hormone replacement therapy effective?

A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. Based on a statistically based protocol, the study was stopped early due an excess number of negative events.

Here's there's two inferential problems.

  1. Is HRT effective?
  2. How long should we continue the trial in the presence of contrary evidence?

See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts

Motivating example: ECMO

In 1985 a group at a major neonatal intensive care center(新生儿重症中心) published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO,ECMO是体外膜肺氧合(extracorporeal membrane oxygenation)的英文简称,它是代表一个医院,甚至一个地区、一个国家的危重症急救水平的一门技术) for newborn infants with severe respiratory failure(严重的呼吸). Ethical considerations(伦理方面的考虑) lead to a statistical randomization scheme whereby(据此) one infant received the control therapy, thereby opening the study to sample-size based criticisms.

For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88

Summary

  • These examples illustrate many of the difficulties of trying to use data to create general conclusions about a population.
  • Paramount(最重要的) among our concerns are:
    • Is the sample representative of the population that we'd like to draw inferences about?
    • Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
    • Is there systematic bias created by missing data or the design or conduct of the study?
    • What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization or random sampling, or implicit as the aggregation of many complex uknown processes.
    • Are we trying to estimate an underlying mechanistic model of phenomena under study?
  • Statistical inference requires navigating the set of assumptions and tools and subsequently thinking about how to draw conclusions from data.
  • Example goals of inference

    1. Estimate and quantify the uncertainty of an estimate of a population quantity (the proportion of people who will vote for a candidate).
    2. Determine whether a population quantity is a benchmark value ("is the treatment effective?").
    3. Infer a mechanistic relationship when quantities are measured with noise ("What is the slope for Hooke's law?")
    4. Determine the impact of a policy? ("If we reduce polution levels, will asthma rates decline?")

Example tools of the trade

  1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
  2. Random sampling: concerned with obtaining data that is representative of the population of interest
  3. Sampling models: concerned with creating a model for the sampling process, the most common is so called "iid".
  4. Hypothesis testing: concerned with decision making in the presence of uncertainty
  5. Confidence intervals: concerned with quantifying uncertainty in estimation
  6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are approximated.
  7. Study design: the process of designing an experiment to minimize biases and variability.
  8. Nonparametric bootstrapping: the process of using the data to, with minimal probability model assumptions, create inferences.
  9. Permutation, randomization and exchangeability testing: the process of using data permutations to perform inferences


causal inference因果推断

1.2概率基础





Example

The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of the American population has sleep apnea(睡眠呼吸暂停). They also report that around 10% of the North American and European population has restless leg syndrome(不宁腿综合症). Does this imply that 13% of people will have at least one sleep problems of these sorts?



Examples of variables that can be thought of as random variables

  • The $(0-1)$ outcome of the flip of a coin
  • The outcome from the roll of a die
  • The BMI of a subject four years after a baseline measurement
  • The hypertension status of a subject randomly drawn from a population


CDF and survival function

  • The cumulative distribution function (CDF) of a random variable X is defined as the function  F(x) = P(X <= x) 
  • This definition applies regardless of whether X is discrete or continuous.
  • The survival function of a random variable X is defined as S(x) = P(X > x) 
  • Notice that S(x) = 1 - F(x)
  • For continuous random variables, the PDF is the derivative of the CDF

Quantiles

  • The a th quantile of a distribution with distribution function F is the point X a so that  F(X a) = a
  • percentile is simply a quantile with a expressed as a percent
  • The median is the $50^{th}$ percentile


1.3期望值

注意
因此,样本均值的期望值是它试图估计的总体值
当一个估计量的期望值是它试图估计的,我们就把这种估计值称为无偏量

Remark

  • Therefore, the expected value of the sample mean is the population mean that it's trying to estimate
  • When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased
小note:
设^θ(X1,X2,…,Xn)是θ的估计量,若E(^θ)=θ,对一切θ∈Θ,则称^θθ的无偏估计量,否则称为θ的有偏估计量。
无偏估计量的定义是:设(ξ∧)是ξ的一个估计量,若E(ξ∧)=ξ ,则称ξ∧是ξ的无偏估计量 


反正可能有点高估吧,实际可能没那么高

1.4独立性




IID random variables

  • Random variables are said to be iid if they are independent and identically distributed
  • iid random variables are the default model for random samples
  • Many of the important theories of statistics are founded on assuming that variables are iid







下面这张图就充分解释了为么是n-1才是无偏估计量







0 0
原创粉丝点击