Bayesian Inference02
来源:互联网 发布:python os.exit 编辑:程序博客网 时间:2024/05/16 17:14
THE SAMPLER
Bayesian Inference with PyMC3 - Part 2
In the second article of this technical series we demonstrate the flexible syntax of PyMC3 with regularized linear modelling of car emissions data and model evaluation.
In the previous blogpost I described where I see Bayesian inference fitting into the data scientist's toolkit, and described some of the theory with a worked example of linear regression on a dummy dataset.
In this post I'll focus on a little more theory and few more features of PyMC3 which allow for feature selection and model validation. In the final post I'll look into hierarchical linear models.
Of course, Bayesian inference is a huge topic and this short series of three blogposts is hardly the best medium for conveying the full complexity: I won't cover more than a few things here. However they will hopefully serve as a reference with links to help you the reader go off and do your own research.
Reading List
A full reading list for Bayesian inference is worthy of an article by itself, and in fact that's exactly what Alex Etz et al. have very recently published (online and in print) in a comprehensive overview of 36 papers and books. Looks like a good place to start.
Curiously they don't mention several books in the canon, I'll note some here:
- The Elements of Statistical Learning, Hastie et al. A great primer, available for free download
- Doing Bayesian Data Analysis, John Kruschke. Future classic, very applied and practical, more info online, see also this notebook containing PyMC3 ports of all the code examples
- Data Analysis using Regression and Multilevel / Hierarchical Models, Gelman & Hill. This one is quite applied and practical, more info online
- Bayesian Data Analysis, Gelman et al. This is very comprehensive with a steep learning curve, more info online
- Information Theory, Inference, and Learning Algorithms, David MacKay. Getting a little old now, but it covers a lot of ground and is very well written, available for free download
For PyMC3-related material, you might want to read:
- Blogs from contributors and maintainers inc. John Salvatier, Chris Fonnesbeck,Thomas Wiecki, Jake van der Plas, Austin Rochford, Andrew Clegg, Peadar Coyle
- The 'Getting Started' section of the PyMC3 online documentation contains links to several guides and a variety of worked examples
- Written for PyMC2, but very useful: Bayesian Methods for Hackers online book by Cameron Davidson-Pilon.
If I've missed anything above, let me know and I'll add to the list.1
Let's find an interesting dataset
In a throwback to a previous technical series on survival analysis I'm looking for data again.
We're currently quite interested in road traffic and vehicle insurance, so I've dug into the UK VCA (Vehicle Type Approval) to find their Car Fuel and Emissions Informationfor August 2015. The raw dataset is available for direct download and is small but varied enough for our use here.
I will investigate the car emissions data from the point-of-view of the Volkswagen Emissions Scandal which seems to have meaningfully damaged their sales. Perhaps we can find unusual results in the emissions data for Volkswagen.
Regression, Feature Selection and Model Evaluation
For brevity, I'll skip the Data Engineering steps (Acquisition, Cleaning, Sampling, Feature & Item Engineering, Quality Control) and launch right into...
- Data exploration and creating an OLS Linear Regression
- Basic feature selection using Lasso Regression - a good excuse to demonstrate the ease of changing model specifications
- Further discuss constraints and demonstrate a Ridge Regression
- Basic model evaluation and comparison using the built-in Deviance Information Criterion (DIC) and Posterior Predictive Checks
Summary & Next Steps
That was rather a long Notebook! I set out to determine if there's anything strange about Volkswagen's NOx emissions results, and I'll freely admit to getting somewhat distracted by demonstrating some of the flexibility of PyMC3 and methods for evaluating results.
I covered quite a lot of ground:
- Quick data exploration, inc 1d, 2d and Nd distribution plotting
- OLS Linear Regression, comparing Frequentist and Bayesian models
- Lasso L1 Regularized Linear Regression, comparing Frequentist and Bayesian models
- Ridge L2 Regularized Linear Regression, comparing Frequentist and Bayesian models
- Naive Model Comparison, inc DIC and Posterior Predictive Checks (MSE, R2 and KS-test)
I don't think we're at all ready to pass judgement on whether Volkswagen's emissions data looks odd, but there are two interesting clues in the Notebook:
- The first is that the feature-value
mfr_owner_is_vw[T.True]
was selected by the Lasso model (both Frequentist and Bayesian versions) indicating that it has a non-negligible effect - In the Ridge model (both versions) this feature-value
mfr_owner_is_vw[T.True]
coefficient receives a value quite far above zero, and indeed the vast majority (>>95%) of its distribution lies above zero.
In the next blogpost and Notebook, I'll demonstrate using hierarchical a.k.a. mixed a.k.a partial-pooled models to directly compare the NOx emissions amongst all the manufacturers and the parent companies too.
Finally
In the previous blogpost I mentioned a new meetup group in London called Bayesian Mixer which I'm helping to organise alongside Markus Gesmann. On Friday 12th Feb we hosted our first meeting upstairs at The Artillery Arms near Old St with 2 short talks from Markus and myself2, and great conversation amongst the ~20 attendees about tools, techniques, interesting Bayesian statistics and just plain old socialising. Markus wrote up more on his blog here.
It was such a success that we've decided to create a community here on meetup.comto help us organise and publicise the next events: the next is on Friday 15th April in Central London. Please feel free to join up and come along for open discussion of Bayesian statistics in a presentation & social format on a semi-regular basis. We hope the group will be a useful place for academics, practitioners and learners alike. We're particularly interested to hear stories about real-world projects, worked examples, and new or unconventional techniques
- Bayesian Inference02
- Bayesian Network
- Bayesian Networks
- Bayesian models
- Bayesian Filtering
- bayesian classifier
- Bayesian举例
- Bayesian Inference
- Naive Bayesian
- Bayesian Inference01
- Bayesian statistics
- Bayesian statistics
- Mahout Bayesian
- mahout bayesian
- Bayesian formula
- Bayesian Network
- [Bayesian] “我是bayesian我怕谁”系列
- [Bayesian] “我是bayesian我怕谁”系列
- 剑指offer-0x01
- android 入门 二
- [BZOJ3211]花神游历各国
- 242. Valid Anagram | Java最短代码实现
- [BZOJ3038]上帝造题的七分钟2
- Bayesian Inference02
- 启用window Server 2008 Aero主题
- 引用作为函数返回值
- css判断不同分辨率显示不同宽度布局实现自适应宽度
- VAGRANT初体验.md
- linux vi命令的使用方法
- mysql不能用localhost链接,只能用127.0.0.1链接的解决方法
- jsp或servlet中删除Cookie
- 五大常用算法——分治法,动态规划,回溯法,分支界限法,贪心算法