2 PROBABLITY DISTRIBUTION

来源：互联网发布：印刷设计图像软件编辑：程序博客网时间：2024/06/06 03:16

1 对于一个有限的训练样本，建立模型去估计它的probability distribution P(X)，称之为density estimation（unsupervised learning）。

我们假设{x1,x2,...,xN}，i.i.d,对于不同的模型，选择合适的分布，是PR的核心问题。

The issue of choosing an appropriate distribution relates to the problem of model selection is a central issue in PR.

共轭先验:如果一个分布的似然函数p(θ|x)和他的先验概率p(θ)具有相同的形式，称之为共轭分布，先验概率p(θ)称之为似然函数的共轭先验，目的是以后计算积分（predictive distribution）时便于计算，例如Gaussian distribution的conjugate prior 还是Gaussian distribution。

接下来，讲了几种典型的参数分布（parameter distribution）:

•伯努利(Bernoulli)分布

•二项(Binomial)分布

•Beta分布 •多项分布

•狄利克雷(Dirichlet)分布

•高斯分布

对于离散随机变量，binomial and multinomial distributions;

连续随机变量，Gaussian distribution。这些分布是由几个固定参数确定的。

How to determine suitable values for their parameters ?

One of Frequentist treatment （频率论的观点）is optimizing some criterion , such as likelihood function.

One of Bayesian treatment （贝叶斯观点） is the prior distribution, we use Bayesian theorem to compute the corresponding posterior distribution.

需要注意的是:如何优化model，或者说如何确定model的系数w？

基于频率论的观点，是优化损失，比如最大似然估计（MLE）,通过最大化似然函数，可以得到model的参数；（等价于最小二乘法，过拟合问题）

基于贝叶斯观点，比如最大后验概率（MAP）,引入先验概率（正则化的最小二乘法，可以抑制最小二乘法）。

2.1 Bernoulli distribution

二值伯努利分布，是对于一件事x，发生的概率为p(x=1|u)=u，不发生的概率为P(x=0|u)=1-u；

Suppose we have an input set {x1, x2, ... xN}, i.i.d,

伯努利分布的概率为期望为u，方差为u(1-u)。

then we could construct the likelihood function for a set of{x1,x2,...,xN}:

上式中似然函数依赖于the sum of xn，这属于统计量，

In a Frequentist treatment, we estimate the parameter by maximum the likelihood function(MLE).

即每次发生的概率

Over-fitting

Suppose we flip a coin 3 times, and happen to observe 3 times heads. then N=m=3, we maximize likelihood function and Uml=1. That means we predict new observation will be head. Common sense tells us that this is not unreasonable, this is a over-fitting phenomenon associated with maximum likelihood function.

过拟合：（1）训练样本太少；（2）模型过于复杂。

这里是训练样本太少的典型例子，如果增大扔硬币的次数，足够大的时候，常识告诉我们，head的概率应该是0.5；

二项分布：binomial distribution，例如抛一枚硬币N次，出现m次正面的概率为：

，期望为Nu，方差为Nu(1-u)；