PyMC3

来源：互联网发布：逻辑学悖论破解知乎编辑：程序博客网时间：2024/06/05 11:26

GLM：线性回归

GLM即Generalized linear model，广义线性模型。
贝叶斯统计的一些软件工具包JAGS, BUGS, Stan 和 PyMC，使用这些工具包需要对将要简历的模型有充分的了解。

线性回归的传统形式

通常，频率学派将线性回归表述为：

Y = X β + ϵ

其中，

Y 是我们期望预测的输出（因变量）；

X 是预测因子（自变量）；

β 是待估计的模型系数（参数）；

ϵ 是误差项，假设为正态分布。

可以使用普通最小二乘或者最大似然法找到 β 的最优解。

线性回归的概率形式

贝叶斯学派采样概率分布形式描述线性回归模型，可表达为如下形式：

Y \sim  (X β, σ 2)

其中，

Y 是一个随机变量，其取值（每个数据点）服从正态分布。这个正态分布的均值有线性变量

Xβ 决定，方差由

σ2 决定。

这个描述有如下两个优点：
* 先验：可以将我们已有的任何先验赋予参数。
* 量化不确定性：不仅得到单个 β 的估计，而是得到 β 的整个后验，表征了 β 对不同取值的可能性。

PyMC3中的贝叶斯GLMs

使用PyMC3中的GLM模型可以简单的构建复杂模型。

%matplotlib inlineimport pymc3 as pmimport numpy as npimport matplotlib.pyplot as plt

生成数据

以截距和斜率产生一条直线，并以直线位均值在其附近产生正态分布的数据点。

# data sizesize = 200# true line parametertrue_intercept = 1true_slope = 2# y = a + b*xx = np.linspace(0, 1, size)true_regression_line = true_intercept + true_slope * x# add noise (sampled data)y = true_regression_line + np.random.normal(scale=.5, size=size)# simulate data data = dict(x=x, y=y)# plot the datafig = plt.figure(figsize=(7, 7));ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model');ax.plot(x, y, 'x', label='sampled data');ax.plot(x, true_regression_line, label='true regression line', lw=2.0);plt.legend(loc=0);

这里写图片描述

直接建模估计

使用贝叶斯线性回归进行建模并拟合这些采样数据。

with pm.Model() as naive_model: # model specifications in PyMC3 are wrapped in a with-statement    # Define priors    intercept = pm.Normal('Intercept', 0, sd=20)    slope = pm.Normal('Slope', 0, sd=20)    sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)    # Define likelihood    likelihood = pm.Normal('y', mu=intercept + slope * x, sd=sigma, observed=y)

使用默认采样器进行采样，PyMC3不需要特别指定采样初始值。NUTS采样器会使用ADVI进行初始化。

with naive_model:    naive_trace = pm.sample(2000)

Auto-assigning NUTS sampler...Initializing NUTS using advi...Average ELBO = -157.32: 100%|██████████| 200000/200000 [00:18<00:00, 10888.26it/s]Finished [100%]: Average ELBO = -157.3100%|██████████| 2000/2000 [00:04<00:00, 432.16it/s]

绘制拟合的结果可以看出，截距(Intercept)的估计值约1.027(真实1)；斜率的估计值约1.893（真实2）。噪声的方差估计值0.528（真实0.5）。进行了准确的估计。

#pm.traceplot(naive_trace);pm.plot_posterior(naive_trace);

这里写图片描述

GLM建模估计

使用GLM模块进行建模估计。

with pm.Model() as glm_model:    pm.glm.glm('y~x',data,\               {'Intercept': pm.Normal.dist(mu=0, sd=20),\                'sd':pm.HalfCauchy.dist(beta=10, testval=1.),\                'x':pm.Normal.dist(mu=0, sd=20)})

??pm.glm.glm

with glm_model:    glm_trace = pm.sample(2000)

Auto-assigning NUTS sampler...Initializing NUTS using advi...Average ELBO = -157.31: 100%|██████████| 200000/200000 [00:21<00:00, 9193.40it/s] Finished [100%]: Average ELBO = -157.32100%|██████████| 2000/2000 [00:05<00:00, 391.03it/s]

#pm.traceplot(glm_trace);pm.plot_posterior(glm_trace);

这里写图片描述

分析模型

贝叶斯推断不仅给出了最优的拟合直线，而且给出了待估计参数的完整后验信息。

左侧给出了三个随机变量的边缘后验：x轴表示可能的值，y轴表示取值对应的可能性。右侧给出了三个随机变量的采样值，可以看出三个随机变量都良好的收敛并稳定（没有出现跳变和奇怪的模式）。

其次，随机变量的最大后验估计很接近与真实值（参数模拟数据是给出的值）。

采用均值后验对参数进行估计。

est_intercept=naive_trace['Intercept'].mean();est_slope=naive_trace['Slope'].mean();est_regression_line = est_intercept + est_slope * x;

??pm.glm.plot_posterior_predictive

plt.figure(figsize=(7, 7))plt.plot(x, y, 'x', label='data');pm.glm.plot_posterior_predictive(glm_trace,samples=200,c='m');plt.plot(x, true_regression_line, label='true regression line', lw=3.0, c='y');plt.plot(x,est_regression_line,label = 'mean estimate regression line',c='k');plt.title('Posterior predictive regression lines');plt.legend(loc=0);plt.xlabel('x');plt.ylabel('y');

这里写图片描述

0 0