prograph01
来源:互联网 发布:java修改文件权限 编辑:程序博客网 时间:2024/06/05 00:12
THE SAMPLER
Probabilistic Graphical Models for Fraud Detection - Part 3
We finish our series on Bayesian networks by discussing conditional probability, more complex models, missing data and other real-world issues in their application to insurance modelling.
In the previous article we introduced a very simple model for medical non-disclosure, set up the network in R
using the gRain
package,then used it to estimate the conditional probability of the necessity of a medical exam,
We discussed flaws of the model in its current form, observing how changes in declared health status affect the likelihood that a medical exam will discover issues impacting the underwriting decision for the policy.
In this final post, we investigate further problems with the model, focusing on getting better outputs and look into other potential uses for the model. In particular, we will look at ways in which we could deal with missing data and investigate potential iterations and improvements.
The Flow of Conditional Probability
A useful way to aid understanding of Bayesian networks is to consider the flow of conditional probability. The idea is that as we assert evidence - setting nodes on the graph to specific values - we affect the conditional probability of nodes elsewhere on the network.
As always, this is easiest done by visualising the network, so I will show it again:
Readers may wish to refer to the previous article for explanations of the nodes.
So how does this work? We have a built a model, so we play with the network and try to understand the behaviour we observe.
In the previous article we took a similar approach, but focused entirely on how such a toy model could be used in practice: setting values for the declared variables of each condition and observing the effect on the conditional probability for
This time we take a more holistic approach, observing the effect of evidence on all nodes on the network. For brevity here we will focus on just a handful of examples.
We start with the smoking condition, in particular the interplay between
hn <- cptable(~HN ,values = c(0.01, 0.99) ,levels = c("Dishonest", "Honest"));ts <- cptable(~TS ,values = c(0.60, 0.20, 0.20) ,levels = c("Nonsmoker", "Quitter", "Smoker"));ds <- cptable(~DS | HN + TS ,values = c(1.00, 0.00, 0.00 # (HN = D, TS = N) ,1.00, 0.00, 0.00 # (HN = H, TS = N) ,0.50, 0.40, 0.10 # (HN = D, TS = Q) ,0.05, 0.80, 0.15 # (HN = H, TS = Q) ,0.30, 0.40, 0.30 # (HN = D, TS = S) ,0.00, 0.10, 0.90 # (HN = H, TS = S) ) ,levels = c("Nonsmoker", "Quitter", "Smoker"));
For clarity, according to the above CPT definitions, if a person is dishonest and hasquit smoking, the probabilities of declaring as a non-smoker, quitter or smoker are
We run a query on the network and calculate the two marginal distributions2 for
> querygrain(underwriting.grain ,nodes = c("DS", "TS") ,type = "marginal"); $TS TS Nonsmoker Quitter Smoker 0.6 0.2 0.2 $DS DS Nonsmoker Quitter Smoker 0.6115 0.1798 0.2087
As expected,
What if we have evidence the applicant is dishonest,
> querygrain(underwriting.grain ,nodes = c("DS", "TS") ,evidence = list(HN = 'Dishonest') ,type = "marginal"); $TS TS Nonsmoker Quitter Smoker 0.6 0.2 0.2 $DS DS Nonsmoker Quitter Smoker 0.76 0.16 0.08
This means that knowing the value of
What if we have evidence the applicant is honest,
> querygrain(underwriting.grain ,nodes = c("DS", "TS") ,evidence = list(HN = 'Honest') ,type = "marginal"); $TS TS Nonsmoker Quitter Smoker 0.6 0.2 0.2 $DS DS Nonsmoker Quitter Smoker 0.61 0.18 0.21
Once again,
It is obvious from the graph that
Before we do this, what is the effect of declaring as a Quitter
> querygrain(underwriting.grain ,nodes = c("TS", "HN") ,evidence = list(DS = 'Quitter') ,type = "marginal"); $HN HN Dishonest Honest 0.00889878 0.99110122 $TS TS Nonsmoker Quitter Smoker 0.000000 0.885428 0.114572
Interesting. There is little effect on
Zero probabilities are problematic. Once one appears, it multiplies out as zero as well, propagating through the model. A first iteration of this model might be to make these conditional probabilities a little less extreme, and is why smoothing to small but non-zero values is helpful.
What if we know that
> querygrain(underwriting.grain ,nodes = c("HN") ,evidence = list(DS = 'Quitter', TS = 'Quitter') ,type = "marginal"); $HN HN Dishonest Honest 0.00502513 0.99497487
We already had
Put another way,
Physical analogies are helpful. Imagine that gravity works in the direction of the arrows, evidence acting as a block in the pipe at that node. To consider the effect of a single node on other nodes, imagine water being poured into the network at that node. The water (the conditional probability) flows down the pipes as the arrows dictate. However, if the node is blocked by evidence, the water will start to fill up against the direction of the arrow.
Similarly, if the water approaches a node from 'below' (that is, against the arrow direction) evidence will also block it from flowing further upwards. We will show this shortly.
In our previous example, with no evidence on the network, the effect of
Creating Data from the Network
The most tedious aspect of creating a Bayesian network is constructing all the CPTs, with potential for error. In many cases this is unavoidable, but we do have an alternative in situations with a sufficient amount of complete data. For instance,gRain
provides functionality to use data alongside a graph specification to construct the Bayesian network automatically.3
This also works in reverse: we can use a Bayesian network to generate data. The ?simulate
method works here:
> underwriting.data.dt <- simulate(underwriting.grain, nsim = 100000);> setDT(underwriting.data.dt);> print(underwriting.data.dt); HN TS TB TH DS DB DH SS SB SH M 1: Honest Smoker None None Smoker Normal None Serious NotSerious NotSerious NoMedical 2: Honest Nonsmoker None None Nonsmoker Normal None NotSerious NotSerious NotSerious NoMedical 3: Honest Nonsmoker Obese None Nonsmoker Obese None Serious NotSerious NotSerious Medical 4: Honest Quitter None None Quitter Normal None NotSerious NotSerious NotSerious NoMedical 5: Honest Nonsmoker None None Nonsmoker Normal None NotSerious NotSerious NotSerious NoMedical --- 99996: Honest Quitter Overweight None Nonsmoker Overweight None NotSerious NotSerious NotSerious NoMedical 99997: Honest Quitter None None Quitter Normal None NotSerious NotSerious NotSerious NoMedical 99998: Honest Nonsmoker None None Nonsmoker Normal None NotSerious NotSerious NotSerious NoMedical 99999: Honest Smoker None None Quitter Overweight HeartDisease NotSerious NotSerious NotSerious Medical 100000: Honest Nonsmoker None None Nonsmoker Normal HeartDisease NotSerious NotSerious NotSerious NoMedical
With this data, we create the network connections and use the data to recreate the network. The DAG is specified by using a list of strings. Each entry in the list specifies a node in the network. If the node is dependent on others this is denoted as a character vector with the dependent nodes following the defined node.
We could specify the DAG for the underwriting network as follows:
underwriting.dag <- dag(list( "HN" ,"TS" ,c("DS","HN","TS") ,"TB" ,c("DB","HN","TB") ,"TH" ,c("DH","HN","TH") ,c("SS","DS","TS") ,c("SB","DB","TB") ,c("SH","DH","TH") ,c("M","SS","SB","SH")));
To create a network using the DAG and the data, we do the following:
underwriting.sim.grain <- grain(underwriting.dag ,data = underwriting.data.df ,smooth = 0.1 );
The unconditional output for
> print(querygrain(underwriting.grain, nodes = c("M"))$M); M Medical NoMedical 0.17793 0.82207> print(querygrain(underwriting.sim.grain, nodes = c("M"))$M); M Medical NoMedical 0.176546 0.823454
So far, so good. What about applications with a clean bill of health?
> print(querygrain(underwriting.grain ,nodes = 'M' ,evidence = list(DS = 'Nonsmoker' ,DB = 'Normal' ,DH = 'None'))$M); M Medical NoMedical 0.14649 0.85351 > print(querygrain(underwriting.sim.grain ,nodes = 'M' ,evidence = list(DS = 'Nonsmoker' ,DB = 'Normal' ,DH = 'None'))$M); M Medical NoMedical 0.145173 0.854827
There are slight differences between the calculations, which is as expected. Most of this is likely due to sample noise, but the smoothing of zero probabilities will also have a small effect.
We could use bootstrapping techniques to get an estimate of the variability and gauge the effect of sample error on our probabilities. This is straightforward to implement in R
but I will leave this be for now.4
Missing Data
One major issue often encountered in this situation is missing data.
Consider our non-disclosure model. A majority of policy applications are not referred for medical exams and so we never discover the
We could just reduce our dataset to complete cases, but this means removing a lot of data, and potentially biasing results: there is no guarantee incomplete data has the same statistical properties as the complete data.
In this situation, how can we construct the network?
My preferred method is to mirror the approach of the Bayesian network: use complete data for each node to calculate the CPTs and build the network node by node. It is time-consuming and tedious for large networks but does make full use of the data.
The benefit of this approach is that it helps maximise the use of your data. You take subsets of the variables in the data and then use complete sets of these subsets to calculate the CPTs.
Alternatively, we can use more traditional missing data techniques such as imputation to complete the missing data, and then construct the Bayesian network as before. Such imputation has another level of complexity.
Finally, there is also the area of semi-supervised learning: techniques designed to help situations with very small amounts of labelled data in otherwise unlabelled datasets. The idea is to make best use of the small amounts of labelled data to eithertransduct the missing labels (unsupervised learning / pattern recognition), or inductthe result dependent on these labels (supervised learning). This is a very interesting area, but we will not discuss any further here.
Iterating and Improving the Model
Our proposed model is far from perfect. In fact at this point, it is little more than a toy model - helpful for developing a familiarity with the approach and for identifying areas we need to improve upon, but far from the finished product.
Altering the CPT values
First we assume the network is acceptable, focussing on improving the outputs of the model. We do this by investigating the values set in the CPTs and seeing if they could be improved.
We already mentioned that the unconditional output of the model for TRUE
value without any other evidence on the network) is a little high at around
Fixing this is not trivial. As discussed, the interaction of the different CPTs makes it tricky to affect output values. This seems annoying but is a feature of the Bayesian network approach, rather than a bug.
First of all, surprising results do not mean wrong results; our intuition can be unreliable. It is best to ask questions of the model and see if the outputs make sense. If they do not, we then investigate our data and ensure there is a problem. If there is, we focus on the CPTs involved in the calculation and check them.
Suppose the unconditional probability for
Looking at the network, the nodes with the most influence on the calculation of
Let's try reducing the probabilities for
m2 <- cptable(~ M | SS + SB + SH ,values = c(0.99, 0.01 # (SS = S, SB = S, SH = S) ,0.80, 0.20 # (SS = N, SB = S, SH = S) ,0.85, 0.15 # (SS = S, SB = N, SH = S) ,0.75, 0.25 # (SS = N, SB = N, SH = S) ,0.80, 0.20 # (SS = S, SB = S, SH = N) ,0.40, 0.60 # (SS = N, SB = S, SH = N) ,0.70, 0.30 # (SS = S, SB = N, SH = N) ,0.05, 0.95 # (SS = N, SB = N, SH = N) ) ,levels = c("Medical", "NoMedical"));underwriting.iteration.grain <- grain(compileCPT(list(hn ,ts, tb, th ,ds, db, dh ,ss, sb, sh ,m2)));querygrain(underwriting.iteration.grain, nodes = c("M"))$M; M Medical NoMedical 0.133279 0.866721
We see that reducing the strength of the effect of the 'seriousness' variables on
Altering the Network
Often the model requires improvement more drastic than the tweaking of CPT values and requires altering the network, including adding or redefining variables or their relationships.
One quick improvement to the model might involve adding more medical conditions that affect pricing. Currently we consider three medical conditions and adding more is straightforward, the one issue being that additional 'seriousness' variables multiplicatively increases the conditioning levels for specifying
There is a potential problem with adding conditions. Going back for a moment to the physical analogy, we are adding extra channels into the
Ultimately though, we may need to move beyond a Bayesian network, as they work best for discrete-value variables.6 Many of the medical readings are continuous in nature, such as weight, height, blood pressure, glucose levels etc, and models should reflect this as much as possible.
To be clear, it is always possible to discretise the continuous variables in a network. This can be effective but loses information in the data. It is better to use the continuous variables directly.
DAGs may prove too limiting, as many biometric readings are likely to be interdependent, which is more complicated to capture with a Bayesian network. We could add some hidden variables to model these mixed dependencies via conditional dependence, but often involves questionable assumptions. Using cyclical networks for this may prove fruitful.
In short, the current model is really just the start: progress could be made along multiple approaches, all worthy of exploration at this point.
Conclusions and Summary
In this series of articles we looked at Bayesian networks and how they might be used in the area of fraud detection.
Despite obvious limitations, the approach seems sound and deals with major issues such as severe class imbalance and missing data in a natural way.
The 11-variable model we discussed, though simple, provides adequate scope for exploration and introspection, and is an interesting example of the approach.
I created the model on pen and paper in perhaps an hour, after a few aborted attempts, and we went through its implementation in R: nothing too complex. Despite this, it produced a number of surprising results and allowed us to learn some non-intuitive facts about the model.
I am happy with the outcomes produced, and I think Bayesian networks are an excellent way to start learning about probabilistic graphical models in general.
While not lacking for further topics, we will leave it here. A fit and proper dealing of the extra topics would take at least half a post in themselves, and may well be the topic of future posts as I explore the area further.
As always, if you have any comments, queries, corrections or criticisms, please get in touch with us, we would love to hear what you have to say.
The 0.10 probability of declaring as a smoker is possibly too high, but this goes back to the importance of precise definitions for variables. I consider dishonesty here as a personality trait that creates a tendency to under-declare health conditions, rather than an absolute. My thinking is that even dishonest people can tell the truth or get forms wrong. This is definitely where subject matter expertise is hugely useful to direct modifications. ↩
In this article I use marginal distributions for illustration rather than the joint or conditional distributions. Whilst less direct, I think marginal distributions are easier to understand as it is just a discrete univariate distribution of probabilities and avoid the necessity of doing arithmetic. I would rather execute more code and have the answer stare right at me than open the possibility of misinterpreting the output. This stuff is subtle enough as it is. ↩
It is also possible to use this data to also derive the structure of the Bayesian network. This is known asstructural learning or model selection. Code to do this requires additional packages and comes with quite a health warning - silly outputs are all to common. We will not discuss this further. ↩
The bootstrap, and resampling techniques in general, is a fascinating approach that I will probably cover at some point with a blogpost in the near future. It is a hugely useful tool in almost all areas of statistical modelling. ↩
As always, the code used in writing this series is available on BitBucket. Get in touch with us if you would like access. ↩
I may be wrong about this limitation with discrete variables. Please correct me if this is wrong. ↩
Applied AI Ltd © 2016.