CS229 Lecture Notes(3): Generalized Linear Models

来源：互联网发布：腾讯数据库泄露编辑：程序博客网时间：2024/06/06 05:47

The exponential family

A class of distributions is in the exponential family if it can be written in the from
p(y,η)=b(y)exp(ηTT(y)−a(η))

where:
- η: the natural parameter (also called the canonical parameter)
- T(y): the sufficient statistic (often be the case that T(y)=y)
- a(η): the log partition function (e−a(η) plays the role as a normalization constant)
在指数分布族中，给定T、a、和b的函数形式，我们就确定了一组以η为参数的分布族。
Bernoulli distribution family:
p(y;ϕ)=ϕy(1−ϕ)1−y=exp(ylogϕ+(1−y)log(1−ϕ))=exp((log(ϕ1−ϕ))y+log(1−ϕ))

thus we have:
- η=log(ϕ1−ϕ)
- ϕ=1/(1+e−η) (the Sigmoid function!)
- T(y)=y
- a(η)=−log(1−ϕ)=log(1+eη)
- b(y)=1
Bernoulli分布是指数分布族的一个例子。值得注意的是，如果我们将Bernoulli分布写成指数分布的形式，并用参数η来表示y=1的概率ϕ，我们很自然地得到了logistic function：ϕ=1/(1+e−η)。在后面学习GLM时，我们将进一步阐释这个结论。
Gaussian distribution family (for simplicity we set σ2=1):
p(y;μ)=12π‾‾‾√exp(−12(y−μ)2)=12π‾‾‾√exp(−12y2)⋅exp(μy−12μ2)

thus we have:
- η=μ
- T(y)=y
- a(η)=μ2/2=η2/2
- b(y)=(1/2π‾‾‾√)exp(−y2/2)
Gaussian分布也是指数分布族的一个例子。只不过，对于Gaussian分布而言，其均值μ（也是要预估的y）恰是其对应指数分布的参数η。后面将会看到为什么要将这些分布写成以η为参数的指数分布的形式。

Constructing GLMs

Motivation: Given the distribution family of response variable (such as Bernoulli distribution or Gaussian distribution), how can we construct a regression/classification hypothesis?
Three assumptions for constructing a Generalized Linear Model:
- p(y|x;θ)∼ExponentialFamily(η)
- h(x)=E[T(y)|x] (for most cases, T(y)=y, which leads to h(x)=E[y|x])
- η=θTx (design choice)
通过上面三个假定得到的模型h(x)称之为Generalized Linear Model。后面会看到，通过这种方式得到的GLMs有着很多优雅的性质，使得模型的学习更加简单高效。
Derivative of Ordinary Least Squares (OLS):
- probabilistic assumption: p(y|x)∼(μ,σ2)∼ExponentialFamily(η)
- canonical response function: g(η)=E[T(y)|x;η]=μ=η
- hypothesis: hθ(x)=g(θTx)=θTx
Derivative of Logistic Regression:
- probabilistic assumption: p(y|x)∼Bernoulli(ϕ)∼ExponentialFamily(η)
- canonical response function: g(η)=E[T(y)|x;η]=ϕ=11+e−η
- hypothesis: hθ(x)=g(θTx)=11+e−θTx

无论是linear regression，还是logistic regression，都是广义线性模型的一个特例。这也隐含着二者在学习算法上的相通性。

Derivative of Softmax Regression:
- multi-classification problem
- probabilistic assumption: p(y|x)∼Multinomial(ϕ1,...,ϕk−1)∼ExponentialFamily(η) with:
  - T(y)∈k−1 and
    $T (y) i = 1 {y = i} = {10 y = i y \neq i$
  - a(η)=−log(ϕk)=−log(1−∑k−1i=1ϕi)
  - b(y)=1
  - η∈k−1 and
    $η i = log ϕ i ϕ k$
- canonical response function:
  $g (η) i = E [T (y) i | x; η] = ϕ i = e η i 1 + \sum k - 1 j = 1 e η j$
  which is called the softmax function
- hypothesis:
  $[h θ (x)] i = g (η) i = e θ T i x 1 + \sum k - 1 j = 1 e θ T j x$
  which is called the softmax regression

0 0