Bayesian Inference01
来源:互联网 发布:python os.exit 编辑:程序博客网 时间:2024/05/17 04:58
Bayesian Inference with PyMC3 - Part 1
Bayesian inference bridges the gap between white-box model introspection and black-box predictive performance. This technical series describes some methods using PyMC3, an inferential framework in Python.
Preamble
I find it useful to think of practical data science as spanning a continuum between traditional statistics and machine learning.
To illustrate:
We might use tools from traditional statistics when creating a time-to-event model because it's useful to know exactly how a particular attribute affects survival. e.g. "How long can I expect a hardrive from Manufacturer A to last compared to B?"
We might use tools from machine learning when clustering data because we want to make unbiased predictions and let the data speak for itself. e.g. "Are these datapoints somehow similar and can I use that to predict future data?"
The strength of the 'Statisticians approach' is the ability to learn about the effects of different inputs upon the model & results - defining behavioural rules for later manipulation.
The strength of the 'Informaticians approach' is to make fewer assumptions about the exact form of the model and let the data do the talking - making predictions about future data.
The best of both worlds: Bayesian Inference
Somewhere towards the middle of the continuum lies Bayesian Inference.
In a nutshell, this is the application of Bayesian probability theory to create powerful white-box models which require quite heavy black-box computation. The benefits are model introspection, high-quality predictions, hypothesis testing and model evaluation.
Background
We've discussed Bayesian analysis in a previous series of articles, but to recap:
We use Bayes Theorem in a principled way, to relate the posterior probability
To belabour the point, this is a very natural way of setting and evaluating expectations in the real world: we assume an initial hypothesis (the coffee looks hot), gather data (the coffee is actually cold), and update our beliefs (coffee that looks hot might actually be cold). It is different from Frequentist statistics in many ways, the largest of which is that Bayesian statistics allows for prior knowledge, and it also makes predictions based on the data available, rather than extrapolating beliefs to infinite data collection1.
Great, but how do we actually use this for modelling?
What do we want from a model? A function that maps input data to create an approximation of some other data.
Lets define:
X X is a vector ofn n input datapointsx1,x2...xn x1,x2...xn which may be each described by a single value, or several features (dimensiond d)y y is a vector ofn n values to be predicted, one per datapoint inX Xθ θ is a vector of parameters sized d, which define the distribution of each datapointx x i.ex∼p(x|θ) x∼p(x|θ)
Then we declare a model specification wherein:
p(θ) p(θ) is the prior distribution or belief ofθ θ before any data is observedp(X|θ) p(X|θ) is the likelihood of the observed data given the parametersθ θp(X) p(X) is the marginal likelihood which we ignore as above- The posterior distribution of the parameters
p(θ|X) p(θ|X) is:
Now in order to learn the correct values (or more properly, the distributions) for
Unfortunately, calculating the integral over all values of
However this isn't very generalisable, and multidimensional integration is hard. This is about as far as we can get without some method for estimating the parameter values.
MCMC Sampling
In the early nineties, an efficient technique for efficiently estimating the parameter values - Markov-Chain Monte-Carlo (MCMC) sampling - was brought into the mainstream by researchers including Alan Gelfand2 and Adrian Smith.
The general principle is to:
- assume a functional form (probability distribution) for parameters
θ θ according to hyperparametersα α - 'sample' this joint probability distribution to get a vector of single values for
θ θacrossd d - compute the posterior predictions
y^ y^ for observed datapoints and compare to the ground truthy y using a loss function - update the joint
θ θ distributions according to the new density of samples - repeat the sampling and evaluation many more times (typically 100x to 10,000x), seeking to take progressively better samples of
θ θ according to minimising the loss function, and arriving at convergence where any new samples don't improve the loss - the MCMC sampling takes the form of a Markov-Chain (the position of step
n+1 n+1 is dependent only upon the position of stepn n, and otherwise independent of all other steps) which moves around the joint distribution in a semi-random manner, the distance and direction decided according to specific rules of the sampling method, including randomness (the Monte-Carlo aspect), gradient-seeking and momentum.
This approach requires a lot of computational power, and some clever maths to ensure the sampling converges and the loss function is minimised well. There's apotted history of the development of MCMC which is great reading, and mentions some of the first software to be widely available from 1991: BUGS (Bayesian inference Using Gibbs Sampling).
This blog series focusses on a fairly new software library to perform MCMC in pure Python: PyMC3. Gladly, it is open source and under active development from a set of well-qualified, dedicated volunteers.
PyMC3 is an iteration upon the prior PyMC2, and comprises a comprehensive package of symbolic statistical modelling syntax and very efficient gradient-based samplers using theTheano library (of deep-learning fame) for gradient computation. Of particular interest is that it includes the Non U-Turn Sampler (NUTS) developed recently byHoffman & Gelman in 2014, which is only otherwise available in STAN.
STAN is a major3 open-source framework for Bayesian inference developed by Gelman, Carpenter, Hoffman and many others. STAN also has HMC and NUTS samplers, and recently, Variational Inference - which is a very efficient way to approximate the joint probability distribution. Models are specified in a custom syntax and compiled to C++ but it has has interfaces to Python, R etc. to make life easier. Since Mick Cooney is a fan, we'll almost certainly discuss it in future blogposts.
Ordinary Least Squares Regression with PyMC3
Finally we're ready to demonstrate Bayesian Inference using PyMC3. This is a potentially huge topic, and I'll leave plenty for future posts. For now, let's start really simple with a tiny dataset and the intention to find a linear model that explains it.
Theory of a basic Linear Regression Model:
STATED IN FREQUENTIST SYNTAX:
... where for datapoint
For ordinary least squares (OLS) regression, we can quite conveniently solve this (find the
For non-OLS regression, finding
STATED SLIGHTLY DIFFERENTLY IN A BAYESIAN SYNTAX:
... where for datapoint
This syntax4 indicates the concept that the realised datapoints are just samples from a distribution of many possible values, itself described by a linear model with variance.
In order to solve this model formulation, we can express the likelihood as the probability of
and maximise the overall likelihood
Usually one would maximise the log-likelihood
Enough maths for now! Lets create a toy dataset and fit it using firstly a Frequentist OLS model, and then a Bayesian OLS model.
Create a very simple dataset and plot it:
I'll use the usual suspects: numpy
and pandas
to create a dataset, and the ever-excellent seaborn
library for plotting.
import numpy as np import pandas as pd import seaborn as sns sns.set(style="darkgrid", palette="muted") rndst = np.random.RandomState(0)def generate_data(n=20, a=1, b=1, c=0, latent_error_y=10): ''' Create a toy dataset based on a very simple linear model that we might imagine is a noisy physical process Model form: y ~ a + bx + cx^2 + e ''' ## create linear or quadratic model df = pd.DataFrame({'x':rndst.choice(np.arange(100),n,replace=False)}) df['y'] = a + b*(df['x']) + c*(df['x'])**2 ## add latent error noise df['y'] += rndst.normal(0,latent_error_y,n) return dfdf = generate_data(a=5, b=2, latent_error_y=30)g = sns.lmplot(x='x', y='y', data=df, fit_reg=False ,size=6, scatter_kws={'alpha':0.8, 's':60})## NOTE: `lmplot()` will fit and plot a lin. reg. line by default. ## Not used here, but can greatly help data exploration in practice.
Observe
- We have 20 datapoints described by a random value in x and a linear relation to yplus some Gaussian noise:
- The parameters
β β of our data are[5,2] [5,2] - The variance
σ2 σ2 of the noise in the data is30 30
- The parameters
- The true 'clean' or non-noisy value of y is overplotted in green
Create and fit a Frequentist OLS model
I'll use the Python statsmodels
library which has nice output similar to R's lm
andglm
functions. Also using patsy
for symbolic model specification.
import patsy as pt import statsmodels.api as sm## first, encode model specification as design matricesfml = 'y ~ 1 + x' (mx_en, mx_ex) = pt.dmatrices(fml, df, return_type='dataframe' ,NA_action='raise')## fit OLS model and print resultssmfit = sm.OLS(endog=mx_en, exog=mx_ex, hasconst=True).fit() print(smfit.summary())
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.864 Model: OLS Adj. R-squared: 0.857 Method: Least Squares F-statistic: 114.6 Date: Tue, 09 Feb 2016 Prob (F-statistic): 3.10e-09 Time: 13:16:59 Log-Likelihood: -92.440 No. Observations: 20 AIC: 188.9 Df Residuals: 18 BIC: 190.9 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept -2.0366 11.614 -0.175 0.863 -26.437 22.364 x 1.9984 0.187 10.705 0.000 1.606 2.391 ==============================================================================Omnibus: 0.384 Durbin-Watson: 3.042 Prob(Omnibus): 0.825 Jarque-Bera (JB): 0.520 Skew: 0.125 Prob(JB): 0.771 Kurtosis: 2.251 Cond. No. 125. ==============================================================================
Lets evaluate the model
1. MODEL PARAMETERS
The statsmodels
OLS class computes a menagerie of test statistics, but we're most interested in the estimates for our
- Intercept:
μ=−2.04,σ=11.61 μ=−2.04,σ=11.61 - x:
μ=2.00,σ=0.19 μ=2.00,σ=0.19
The Frequentist OLS has made a fairly good estimate of x, but the Intercept is quite imprecise under a linear model.
2. MODEL PREDICTION
Lets plot the posterior prediction:
## NOTE: I'll use `seaborn`'s `lmplot` again. ## This will actually run its own regression, but it uses statsmodels OLS, ## so while strictly speaking this won't plot exactly the same prediction ## as the results above, it will be more than close enough for our purposes.g = sns.lmplot(x='x', y='y', data=df, fit_reg=True ,size=6, scatter_kws={'alpha':0.8, 's':60})
Create and fit a Bayesian OLS model
Here, finally we'll use pymc3
to define a model specification and take samples of
import pymc3 as pm from scipy.optimize import fmin_powellwith pm.Model() as mdl_ols: ## Use GLM submodule for simplified patsy-like model spec ## Use Normal likelihood (which uses HalfCauchy for error prior) pm.glm.glm('y ~ 1 + x', df, family=pm.glm.families.Normal()) ## find MAP using Powell optimization start_MAP = pm.find_MAP(fmin=fmin_powell, disp=True) ## take samples using NUTS trc_ols = pm.sample(2000, start=start_MAP, step=pm.NUTS())
Applied log-transform to sd and added transformed sd_log to model. Optimization terminated successfully. Current function value: 126.077720 Iterations: 6 Function evaluations: 221 [-----------------100%-----------------] 2000 of 2000 complete in 3.4 sec
There's a lot of detail in the above:
- We created a
PyMC3
model usingpatsy
model specification syntax for convenience. In the background this created a model and a set of 'free' and 'observed' variables in atheano
compute graph. Compared to STAN, the model specification is written in pure Python, and no external C++ code is compiled. - We used an optimizer from
scipy
to find the Max A-Posteriori (MAP) estimate of the joint probability distribution, and thus a good starting point for the sampler. - We then used the NUTS sampler to take 2000 samples from the joint probability distribution of
β β and converge via evaluating and iteratively minimising the loss on the posterior predictive distributiony^ y^ with respect to the true valuesy y.
The results of the above are in the form of a 'trace' for each
Lets view the traces:
ax = pm.traceplot(trc_ols[-1000:], figsize=(12,len(trc_ols.varnames)*1.5), lines={k: v['mean'] for k, v in pm.df_summary(trc_ols[-1000:]).iterrows()})
We can use these traces in two very powerful ways:
The traces form a marginal distribution on each parameter (shown in the left-hand facets), letting us declare distributional statistics directly from the data with no assumptions about their functional form. We can quote the mean, variance, credible regions (CR) etc, and interrogate the full distribution to help determine & better understand unusual behaviours of the parameter values: skew, kurtosis, multi-modalities etc.
The raw trace values (shown in the right-hand facets) can be used to create 1000posterior predictions of
y y for any value ofx x, again without any assumptions about the functional form, and again, we can quote any distributional statistics we like from that distribution. The traces also offer a diagnostic check on the convergence (or lack thereof) of the model.
1. MODEL PARAMETERS
Lets take a look at the parameter values, using a convenience function:
print(pm.df_summary(trc_ols[-1000:]))
mean sd mc_error hpd_5 hpd_95Intercept -1.367470 13.444980 0.658985 -25.138135 26.942143 x 1.989229 0.215509 0.010461 1.572763 2.432014 sd_log 3.305677 0.186010 0.012743 2.975143 3.683130 sd 27.758150 5.457890 0.374594 19.563791 39.734219
We're most interested in the estimates for our
- Intercept:
μ=−1.37,σ=13.44 μ=−1.37,σ=13.44 - x:
μ=1.99,σ=0.22 μ=1.99,σ=0.22 - sd (the Gaussian noise):
μ=27.76,σ=5.46 μ=27.76,σ=5.46
The Bayesian OLS has made very similar estimates of the parameter values as the Frequentist OLS. This is good to see, since both linear models are actually very similar.
Note we now also have an estimate of the Gaussian noise parameter, which is quite close to the true value of
We also have real intervals a.k.a credible regions (CR) on the estimates of the parameter values. In the Frequentist OLS model the 'confidence intervals' are created by fitting a Normal distribution over the point-estimate MLE values of each parameter. In the Bayesian inferential method, we can simply use the distribution of samples (after convergence) to learn the uncertainty in the parameter values.
2. MODEL POSTERIOR PREDICTION
import matplotlib.pyplot as pltdef plot_posterior_cr(mdl, trc, rawdata, xlims, npoints=1000): ''' Convenience fn: plot the posterior predictions from mdl given trcs ''' ## extract traces trc_mu = pm.trace_to_dataframe(trc)[['Intercept','x']] trc_sd = pm.trace_to_dataframe(trc)['sd'] ## recreate the likelihood x = np.linspace(xlims[0], xlims[1], npoints).reshape((npoints,1)) X = x ** np.ones((npoints,2)) * np.arange(2) like_mu = np.dot(X,trc_mu.T) like_sd = np.tile(trc_sd.T,(npoints,1)) like = np.random.normal(like_mu, like_sd) ## Calculate credible regions and plot over the datapoints dfp = pd.DataFrame(np.percentile(like,[2.5, 25, 50, 75, 97.5], axis=1).T ,columns=['025','250','500','750','975']) dfp['x'] = x pal = sns.color_palette('Purples') f, ax1d = plt.subplots(1,1, figsize=(7,7)) ax1d.fill_between(dfp['x'], dfp['025'], dfp['975'], alpha=0.5 ,color=pal[1], label='CR 95%') ax1d.fill_between(dfp['x'], dfp['250'], dfp['750'], alpha=0.4 ,color=pal[4], label='CR 50%') ax1d.plot(dfp['x'], dfp['500'], alpha=0.5, color=pal[5], label='Median') _ = plt.legend() _ = ax1d.set_xlim(xlims) _ = sns.regplot(x='x', y='y', data=rawdata, fit_reg=False ,scatter_kws={'alpha':0.8,'s':80, 'lw':2,'edgecolor':'w'}, ax=ax1d)xlims = (df['x'].min() - np.ptp(df['x'])/10 ,df['x'].max() + np.ptp(df['x'])/10)plot_posterior_cr(mdl_ols, trc_ols, df, xlims)
In the above code, we've used the parameter trace values to generate new predictions
Next steps
This was something of a whirlwind tour, but we've only just scratched the surface. My intention has been to start slowly from the basics, and demonstrate the differences and advantages of Bayesian inference vs Frequentist statistics - at least for very simple OLS Linear Regression.
In the next blogpost I'll use a real dataset and demonstrate the flexibility of Bayesian inference - and PyMC3 in particular - regarding model specification and validation.
Finally
I'd like to mention a new meetup group in London that I'm helping to organise alongside Markus Gesmann: it's called Bayesian Mixer, and will be dedicated to open discussion of Bayesian statistics in a normal presentation & social format on a semi-regular basis.
The first meeting is the evening of Friday 12 Feb in Old Street, London and you can read much more (and sign up) here on Markus' blog.
Jake Vanderplas has an excellent blog series on Bayesian vs Frequentist statistics. Essential reading. ↩
With whom we've had the pleasure of working, albeit briefly. Hello if you're reading! This appears to be the paper that kickstarted it all. ↩
You could say STAN is the gold-standard at present, although there's several frameworks available (JAGS, STAN, PyMC2/3) and specific samplers (emcee) and they all have various tradeoffs. I really like the ease-of-use of PyMC3, and there's a handful of comparisons available online to give you a background. ↩
Hat-tip to Thomas Wiecki for his blogposts on this very subject, from which I've borrowed one or two things.↩
- Bayesian Inference01
- Bayesian Network
- Bayesian Networks
- Bayesian models
- Bayesian Filtering
- bayesian classifier
- Bayesian举例
- Bayesian Inference
- Naive Bayesian
- Bayesian Inference02
- Bayesian statistics
- Bayesian statistics
- Mahout Bayesian
- mahout bayesian
- Bayesian formula
- Bayesian Network
- [Bayesian] “我是bayesian我怕谁”系列
- [Bayesian] “我是bayesian我怕谁”系列
- [BZOJ3196]Tyvj 1730 二逼平衡树
- 序列号和设备标识UDID码手机直接获取方式
- 机器学习:贝叶斯总结_1:概述
- NGINX学习笔记——传递请求头
- OpenGL实践3之第一个着色器程序2
- Bayesian Inference01
- 剑指offer-0x01
- android 入门 二
- [BZOJ3211]花神游历各国
- 242. Valid Anagram | Java最短代码实现
- [BZOJ3038]上帝造题的七分钟2
- Bayesian Inference02
- 启用window Server 2008 Aero主题
- 引用作为函数返回值