使用R语言的BNLearn包实现贝叶斯网络

来源：互联网发布：linux上新建weblogic域编辑：程序博客网时间：2024/05/17 06:31

转载自：http://f.dataguru.cn/thread-301701-1-1.html

1. 加载程序包导入数据

library(bnlearn)  #CRAN中有，可以直接用install.packages(“bnlearn”)安装或者去网上下载后复制到library文件夹下即可。

library(Rgraphviz)  #用于绘图。这个包CRAN中没有，需要到http://www.bioconductor.org/pack ... ws.html#___Software 去下载。

data(learning.test)  #导入数据，数据框中的变量必须全部为因子型（离散）或数值型（连续）。

lear.test =read.csv("***.csv", colClasses ="factor")  #也可以直接从csv文件直接导入数据。需要注意的是如果数据中含有0-1之类的布尔型，或者1-3之类的等级数据，需要强行指定其为因子型，不然许多BN函数会报错。因为read函数只会自动的将字符型转换成因子型，其他的不会自动转换。

该包包含贝叶斯网络的结构学习、参数学习和推理三个方面的功能，其中结构学习包含基于约束的算法、基于得分的算法和混合算法，参数学习包括最大似然估计和贝叶斯估计两种方法。此外还有引导（bootstrap），交叉验证（cross-validation）和随机模拟（stochastic simulation）等功能，附加的绘图功能需要调用前述的Rgraphviz and lattice包。

Bayesian network structure learning (via constraint-based, score-based and hybrid algorithms), parameter learning (via ML and Bayesian estimators) and inference.

This package implements some algorithms for learning the structure of Bayesian networks.

Constraint-based algorithms, also known as conditional independence learners, are all optimized derivatives of the Inductive Causation algorithm (Verma and Pearl, 1991). These algorithms use conditional independence tests to detect the Markov blankets of the variables, which in turn are used to compute the structure of the Bayesian network.

Score-based learning algorithms are general purpose heuristic optimization algorithms which rank network structures with respect to a goodness-of-fit score.

Hybrid algorithms combine aspects of both constraint-based and score-based algorithms, as they use conditional independence tests (usually to reduce the search space) and network scores (to find the optimal network in the reduced space) at the same time.

Several functions for parameter estimation, parametric inference, bootstrap, cross-validation and stochastic simulation are available. Furthermore, advanced plotting capabilities are implemented on top of the Rgraphviz and lattice packages.

2 基于约束的算法

Bnlearn包中可使用的基于约束的算法有gs、iamb、fast.iamb、inter.iamb。

Available constraint-based learning algorithms

引用方法很简单，就是函数名加数据框作为参数就可以了。做结构学习的时候还可以自定义黑名单、白名单列表，在学习中引入专家知识。

res = gs(learning.test)

Grow-Shrink算法（GS）：是第一个（也是最简单）将马尔科夫边界检测算法（Margaritis，2003年）用于结构学习的算法。伸展/收缩。

Grow-Shrink (gs): based on the Grow-Shrink Markov Blanket, the first (and simplest) Markov blanket detection algorithm (Margaritis, 2003) used in a structure learning algorithm.

Incremental Association（iamb）：基于马尔可夫边界检测算法相同的名称（Tsamardinos等，2003），这是基于两个阶段的选择方案（一个向前的选择后紧跟着尝试消除误报）。增量协会

Incremental Association (iamb): based on the Markov blanket detection algorithm of the same name (Tsamardinos et al., 2003), which is based on a two-phase selection scheme (a forward selection followed by an attempt to remove false positives).

Fast Incremental Association（fast.iamb）：IAMP使用投机逐步向前选择条件独立测试的人数减少（Yaramakala Margaritis，2005年）的一个变种。快速增量协会

Fast Incremental Association (fast.iamb): a variant of IAMB which uses speculative stepwise forward selection to reduce the number of conditional independence tests (Yaramakala and Margaritis, 2005).

Interleaved Incremental Association（inter.iamb）：IAMP的另一个变种，采用向前逐步选择（Tsamardinos等，2003），以避免误报马尔可夫边界检测阶段。交错增量协会

Interleaved Incremental Association (inter.iamb): another variant of IAMB which uses forward stepwise selection (Tsamardinos et al., 2003) to avoid false positives in the Markov blanket detection phase.

这些算法的计算复杂度是多项式的测试的数量，通常为O（N ^ 2）（O（N ^ 4）在最坏的情况下），其中N是变量的数目。执行的时间尺度线性数据集的大小。

The computational complexity of these algorithms is polynomial in the number of tests, usually O(N^2) (O(N^4) in the worst case scenario), where N is the number of variables. Execution time scales linearly with the size of the data set.

条件独立测试

（有条件）独立测试

Available (conditional) independence tests

基于约束的算法在实践中使用的条件独立测试，统计测试数据集。可用的测试（以及相应的标签）包括：

The conditional independence tests used in constraint-based algorithms in practice are statistical tests on the data set. Available tests (and the respective labels) are:

离散情况（多项式分布）

discrete case (multinomial distribution)

互信息：理论上的信息距离测度。相关的测试模型有：渐近卡方检验（MI），蒙特卡罗置换检验（MC-MI），序贯蒙特卡罗置换检验（SMC-MI），和半参数测试（SP-MI）。

mutual information: an information-theoretic distance measure. It's proportional to the log-likelihood ratio (they differ by a 2n factor) and is related to the deviance of the tested models. The asymptotic chi-square test (mi), the Monte Carlo permutation test (mc-mi), the sequential Monte Carlo permutation test (smc-mi), and the semiparametric test (sp-mi) are implemented.

？互信息（MI-SH）：基于互信息的J-S估计的改进渐近卡方检验。测试模型包括：皮尔逊的X ^ 2：经典的皮尔逊的X ^ 2检验应急表。渐近卡方检验（X2），蒙特卡罗（MC-X^2）置换检验，序贯蒙特卡罗置换检验（SMC-X2）和半参数测试（SP-X2）来实现。

shrinkage estimator for the mutual information (mi-sh): an improved asymptotic chi-square test based on the James-Stein estimator for the mutual information. Pearson's X^2: the classical Pearson's X^2 test for contingency tables. The asymptotic chi-square test (x2), the Monte Carlo permutation test (mc-x2), the sequential Monte Carlo permutation test (smc-x2) and semiparametric test (sp-x2) are implemented.

连续情况（多元正态分布）

continuous case (multivariate normal distribution)

线性相关性：线性相关。检验方法包括：t检验（COR），蒙特卡罗置换检验（MC-COR）和序贯蒙特卡罗置换检验（SMC-COR）。

linear correlation: linear correlation. The exact Student's t test (cor), the Monte Carlo permutation test (mc-cor) and the sequential Monte Carlo permutation test (smc-cor) are implemented.

Fisher's Z: a transformation of the linear correlation with asymptotic normal distribution. Used by commercial software (such as TETRAD II) for the PC algorithm (an R implementation is present in the pcalg package on CRAN). The asymptotic normal test (zf), the Monte Carlo permutation test (mc-zf) and the sequential Monte Carlo permutation test (smc-zf) are implemented.

互信息：与离散的情况相同。包括渐进的卡方检验（MI-G），蒙特卡罗置换检验（MC-MI-G）和序贯蒙特卡罗置换检验（SMC-MI-G）。

mutual information: an information-theoretic distance measure. Again it's proportional to the log-likelihood ratio (they differ by a 2n factor). The asymptotic chi-square test (mi-g),the Monte Carlo permutation test (mc-mi-g) and the sequential Monte Carlo permutation test (smc-mi-g) are implemented.

？互信息（MI-G-SH）：与离散的情况相同。

shrinkage estimator for the mutual information (mi-g-sh): an improved asymptotic chi-square test based on the James-Stein estimator for the mutual information.

3 基于得分的算法

Available score-based learning algorithms

爬山（hc）：在有向图空间上执行贪婪爬山搜索。

Hill-Climbing (hc): a hill climbing greedy search on the space of the directed graphs. The optimized implementation uses score caching, score decomposability and score equivalence to reduce the number of duplicated tests.

禁忌搜索（tabu）：修改后的爬山法，能够避免局部最优。

Tabu Search (tabu): a modified hill climbing able to escape local optima by selecting a network that minimally decreases the score function.

Random restart with a configurable number of perturbing operations is implemented for both algorithms.

可用的得分包括：

Available network scores

Available scores (and the respective labels) are:

离散情况（多项式分布）

discrete case (multinomial distribution)

多项式的对数似然（loglik）得分，相当于Weka中对熵的测度。

the multinomial log-likelihood (loglik) score, which is equivalent to the entropy measure used in Weka.

赤池信息量准则得分（aic）。

the Akaike Information Criterion score (aic).

贝叶斯信息量准则得分（bic），相当于MDL（也称为Schwarz信息准则）。

the Bayesian Information Criterion score (bic), which is equivalent to the Minimum Description Length (MDL) and is also known as Schwarz Information Criterion.

对数形式的贝叶斯狄氏等价得分（bde），相当于狄氏后验密度。

the logarithm of the Bayesian Dirichlet equivalent score (bde), a score equivalent Dirichlet posterior density.

对数形式的修正贝叶斯狄氏等价得分（mbde），使用试验和观察的数据混合（没有与之相对应的得分）。

the logarithm of the modified Bayesian Dirichlet equivalent score (mbde) for mixtures of experimental and observational data (not score equivalent).

对数形式的K2得分（k2），一种狄氏后验密度（没有与之相应的得分，K2算法在Matlab的BN工具箱里有）。

the logarithm of the K2 score (k2), a Dirichlet posterior density (not score equivalent).

连续情况（多元正态分布）

continuous case (multivariate normal distribution)

多元高斯对数似然得分（loglik-g）。

the multivariate Gaussian log-likelihood (loglik-g) score.

赤池信息量准则评分（aic-g）。

the corresponding Akaike Information Criterion score (aic-g).

贝叶斯信息量准则（bic-g）的得分。

the corresponding Bayesian Information Criterion score (bic-g).

高斯后验密度（bge）。

a score equivalent Gaussian posterior density (bge).

***附注*****************

The log likelihood of the model is the value that is maximized by the process that computes the maximum likelihood value for the Bi parameters.

The Deviance is equal to -2*log-likelihood.

Akaike’s Information Criterion (AIC) is -2*log-likelihood 2*k where k is the number of estimated parameters.

The Bayesian Information Criterion (BIC) is -2*log-likelihood k*log(n) where k is the number of estimated parameters and n is the sample size.  The Bayesian Information Criterion is also known as the Schwartz criterion.

赤池信息量准则，即Akaike information criterion，简称AIC，是衡量统计模型拟合优良性的一种标准，是由日本统计学家赤池弘次创立和发展的。赤池信息量准则建立在熵的概念基础上，可以权衡所估计模型的复杂度和此模型拟合数据的优良性。

在一般的情况下AIC可以表示为：AIC=(2k-2L)/n

它的假设条件是模型的误差服从独立正态分布。其中：k是参数的数量，L是对数似然函数,n是观测值数目。

AIC的大小取决于L和k。k取值越小，AIC越小；L取值越大，AIC值越小。k小意味着模型简洁，L大意味着模型精确。因此AIC和修正的决定系数类似，在评价模型是兼顾了简洁性和精确性。

具体到，L=-(n/2)*ln(2*pi)-(n/2)*ln(sse/n)-n/2.其中n为样本量，sse为残差平方和

表明增加自由参数的数目提高了拟合的优良性，AIC鼓励数据拟合的优良性但是尽量避免出现过度拟合(Overfitting)的情况。所以优先考虑的模型应是AIC值最小的那一个。赤池信息准则的方法是寻找可以最好地解释数据但包含最少自由参数的模型。

4 混合算法

Available hybrid learning algorithms

Max-Min Hill-Climbing (mmhc): a hybrid algorithm which combines the Max-Min Parents and Children algorithm (to restrict the search space) and the Hill-Climbing algorithm (to find the optimal network structure in the restricted space).

Restricted Maximization (rsmax2): a more general implementation of the Max-Min Hill-Climbing, which can use any combination of constraint-based and score-based algorithms.

5 其他算法

Other (constraint-based) local discovery algorithms

这些算法实现与贝叶斯网络相关的无向图的结构学习，通常使用混合学习算法。

These algorithms learn the structure of the undirected graph underlying the Bayesian network, which is known as the skeleton of the network or the (partial) correlation graph. Therefore all the arcs are undirected, and no attempt is made to detect their orientation. They are often used in hybrid learning algorithms.

Max-Min Parents and Children (mmpc): a forward selection technique for neighbourhood detection based on the maximization of the minimum association measure observed with any subset of the nodes selected in the previous iterations (Tsamardinos, Brown and Aliferis, 2006).

6 贝叶斯网络分类

Bayesian Network classifiers

算法的目的是分类，赞成的能力，以恢复正确的网络结构的预测能力。实施在bnlearn假定所有的变量，包括分类器，是离散的。

The algorithms are aimed at classification, and favour predictive power over the ability to recover the correct network structure. The implementation in bnlearn assumes that all variables, including the classifiers, are discrete.

朴素贝叶斯（naive.bayes）：一个很简单的算法，假设是所有的分类是独立的，使用分类目标变量的后验概率。

Naive Bayes (naive.bayes): a very simple algorithm assuming that all classifiers are independent and using the posterior probability of the target variable for classification.

树增强朴素贝叶斯（tree.bayes）：这种算法使用一个朴素贝叶斯改进，周刘近似的依赖结构的分类。

Tree-Augmented Naive Bayes (tree.bayes): a improvement over naive Bayes, this algorithms uses Chow-Liu to approximate the dependence structure of the classifiers.

Hiton Parents and Children (si.hiton.pc): a fast forward selection technique for neighbourhood detection designed to exclude nodes early based on the marginal association. The implementation follows the Semi-Interleaved variant of the algorithm described in Aliferis et al. (2010).

Chow-Liu (chow.liu): an application of the minimum-weight spanning tree and the information inequality. It learn the tree structure closest to the true one in the probability space (Chow and Liu, 1968).

ARACNE (aracne): an improved version of the Chow-Liu algorithm that is able to learn polytrees (Margolin et al., 2006).

0 0