In the final article of this technical series we demonstrate hierarchical linear regression using PyMC3 to compare vehicle NOx emissions for a range of car manufacturers.

In the previous blogpost I spent some time demonstrating the use of PyMC3 for regularized linear regression and model evaluation using criteria such as the DIC and posterior predictive checks.

In this post I'll look into using hierarchical linear models, demonstrating how the pure python PyMC3 syntax makes all this quite straightforward. I'll use the same UK VCA(Vehicle Type Approval) Car Fuel and Emissions Information dataset as last time, and for the sake of direction, will look for hints of unusual NOx emissions from Volkswagen cars.

As I noted last time, Bayesian inference is a huge topic and this short series of three blogposts is not the best medium for conveying the full complexity. Please see the reading list on my previous post for more details, and if you're in London and interested in this stuff, I highly recommend you drop into the Bayesian Mixer meetup group for more discussions, demos and general statistics chat.


Hierarchical Linear Regression

Hierarchical aka. mixed-effect a.k.a partial-pooled models let us more accurately reflect the real-world where a dataset may contain nested features.

In our case, the dataset contains two obvious nestings:

1. Each car model is made by a manufacturer.

There are 38 manufacturers, and 2593 models: each manufacturer is responsible for between 1 and 539 unique car models, median of 27. If we were to include car models in the linear modelling, then we are likely well-served to partially-pool the models according to their manufacturer.

2. Each car manufacturer has a parent-company or owner

Those 38 manufacturers could perhaps better be called 'brands', since there are 20 top-level holding companies owning them. We might also be well-served to reflect this real-world ownership structure in the modelling.

For simplicity, I won't include the car models, but I will demonstrate the effects of partial-pooling the manufacturers and owners in the following Notebook:



Summary

The Notebook demonstrates the effects of pooling parameter values and the possibilities for using a (fairly simple) linear regression to infer differences between car manufacturers. Obviously this technique can be applied to other cases where there are 'classes' of items which share common features.

We evaluated several different types of pooling and found that there are material differences in NOx emissions between different car manufacturers and their parent owners.

It would appear that in our dataset, Volkswagen as a mfr (and moreso as amfr_owner) does have a higher and narrower influence on NOx emissions than the average, and would be worthy of more investigation. However, this exercise was really aimed at demonstrating Bayesian inferential techniques and PyMC3 in particular, and I won't say anything libellous about Volkswagen!

Other thoughts

In this three part mini-series we covered: linear regression, regularization, model evaluation and hierarchical modelling, and still only scratched the surface of what's possible! There's a huge amount to cover in Bayesian inference and through I wanted to jump to the end results, it's important (and I think very interesting) to cover various intricacies along the way: including model specification, regularization, evaluation, pooling and visual and testable interpretation.

In future I'll likely return to this subject and cover generalised linear modelling, more detailed model evaluation inc cross-validation and counting-based problems. I may also compare PyMC3 with PyStan. PyMC3 is a capable and flexible package1 - in particular the model specification has a really intuitive syntax.

A few readers asked for the Notebooks in non-rendered executable form, so I'll post this series in a more concise form on a public github site in a few weeks so you can more easily use the code.


Finally

I will use some of this material during my forthcoming talk at the PyData London 2016 conference at the start of May, titled "Hierarchical Bayesian Modelling with PyMC3 and PySTAN". It's a great conference for Pythonistas and data science practitioners: and I highly recommend it to readers of this blog.