Rough Set Theory

来源:互联网 发布:java编写一个日历程序 编辑:程序博客网 时间:2024/05/22 12:41

开始从事数据挖掘方面的工作也已经有一年了,经常看CSND和国外一些博客上面的文章,但是从来也没有自己总结过,感觉很快就会忘记了,所以想从现在开始,用博客的方式来总结自己工作中的所学所得。


这篇关于Rough Set的整理主要是wiki上面的内容,因为找了好几个讲解Rough Set的资料感觉都不是很清晰,看到wiki上面的内容到时清楚明白,所以打算根据wiki上面的内容总结一下。

原文链接:点击打开链接 (https://en.wikipedia.org/wiki/Rough_set)


1. Definitions

1.1 Information system framework

 Let I = (\mathbb{U},\mathbb{A}) be an information system (attribute-value system).

       \mathbb{U} is a non-empty set of finite objects (the universe)   \mathbb{U} = {O_{1} ~ O_{10}}

       \mathbb{A} is a non-empty, finite set of attributes           \mathbb{A} = {P_{1} ~ P_{5}}

 Such that a:\mathbb{U} \rightarrow V_a for every a \in \mathbb{A}

      V_a is the set of values that attribute a may take.   V_a = {0, 1, 2} if a = P_{1}

      a(x) is the value of object x 's attribute a 's value. a(x) = 1 if a = P_{1} and x = O_{1}


An example of Information system table:



With any P \subseteq \mathbb{A} there is an associated equivalence relation \mathrm{IND}(P):

  \mathrm{IND}(P) = \left\{(x,y) \in \mathbb{U}^2 \mid \forall a \in P, a(x)=a(y)\right\}


The relation \mathrm{IND}(P) is called a P-indiscernibility relation

The partition of \mathbb{U} is a family of all equivalence classes of \mathrm{IND}(P) and is denoted by \mathbb{U}/\mathrm{IND}(P) (or \mathbb{U}/P).

If (x,y)\in \mathrm{IND}(P), then x and y are indiscernible (or indistinguishable) by attributes from P .


For example:

if  P = \{P_{1},P_{2},P_{3},P_{4},P_{5}\}

so the equivalence classes:



if attribute P =\{ P_{1}\} alone is selected

so the equivalence classes:



1.2 Definition of a Rough Set

Let X \subseteq \mathbb{U} be a target set that we wish to represent using attribute subset P.

For example

consider the target set X = \{O_{1},O_{2},O_{3},O_{4}\}, and let attribute subset P = \{P_{1}, P_{2}, P_{3}, P_{4}, P_{5}\}, the full available set of features. It will be noted that the set X cannot be expressed exactly, because in [x]_P,, objects \{O_{3}, O_{7}, O_{10}\} are indiscernible. Thus, there is no way to represent any set X which includes O_{3} but excludes objects O_{7} and O_{10}.

However, the target set X can be approximated using only the information contained within P by constructing the P-lower and P-upper approximations of X:

  {\underline P}X= \{x \mid [x]_P \subseteq X\}
  {\overline P}X = \{x \mid [x]_P \cap X \neq \emptyset \}


Lower approximation and positive region


{\underline P}X = \{O_{1}, O_{2}\} \cup \{O_{4}\}  
is the union of all equivalence classes in [x]_P which are contained by (i.e., are subsets of) the target set.

The lower approximation is the complete set of objects in \mathbb{U}/P that can be positively (i.e., unambiguously) classified as belonging to target set X.

Upper approximation and negative region


{\overline P}X = \{O_{1}, O_{2}\} \cup \{O_{4}\} \cup \{O_{3}, O_{7}, O_{10}\}
is the union of all equivalence classes in [x]_P which have non-empty intersection with the target set.

The upper approximation is the complete set of objects that in \mathbb{U}/P that cannot be positively (i.e., unambiguously) classified as belonging to the complement (\overline X) of the target set X. In other words, the upper approximation is the complete set of objects that are possibly members of the target set XThe set \mathbb{U}-{\overline P}X therefore represents the negative region, containing the set of objects that can be definitely ruled out as members of the target set.

Boundary region

The boundary region, given by set difference {\overline P}X - {\underline P}X, consists of those objects that can neither be ruled in nor ruled out as members of the target set X.

The rough set

The tuple \langle{\underline P}X,{\overline P}X\rangle composed of the lower and upper approximation is called a rough set; thus, a rough set is composed of two crisp sets, one representing a lower boundary of the target set X, and the other representing an upper boundary of the target set X.



The accuracy of the rough-set representation of the set X :


That is, the accuracy of the rough set representation of X\alpha_{P}(X)0 \leq \alpha_{P}(X) \leq 1, is the ratio of the number of objects which can positively be placed in X to the number of objects that can possibly be placed in X – this provides a measure of how closely the rough set is approximating the target set.


1.3 Definability


if {\overline P}X = {\underline P}X, we say the X is definable on attribute set  P, otherwise, it's undefinable.


  • Set X is internally undefinable if {\underline P}X \neq \emptyset and {\overline P}X = \mathbb{U}. This means that on attribute set P, there are objects which we can be certain belong to target set X, but there are no objects which we can definitively exclude from set X.
  • Set X is externally undefinable if {\underline P}X = \emptyset and {\overline P}X \neq \mathbb{U}. This means that on attribute set P, there are no objects which we can be certain belong to target set X, but there are objects which we can definitively exclude from set X.
  • Set X is totally undefinable if {\underline P}X = \emptyset and {\overline P}X = \mathbb{U}. This means that on attribute set P, there are no objects which we can be certain belong to target set X, and there are no objects which we can definitively exclude from set X. Thus, on attribute set P, we cannot decide whether any object is, or is not, a member of X.

    1.4 Reduct and core

    Formally, a reduct is a subset of attributes \mathrm{RED} \subseteq P such that


    • [x]_{\mathrm{RED}} = [x]_P, that is, the equivalence classes induced by the reduced attribute set \mathrm{RED} are the same as the equivalence class structure induced by the full attribute set P.
    • the attribute set \mathrm{RED} is minimal, in the sense that [x]_{(\mathrm{RED}-\{a\})} \neq [x]_P for any attribute a \in \mathrm{RED}; in other words, no attribute can be removed from set \mathrm{RED} without changing the equivalence classes [x]_P.

    So, for example:

    attribute set \{P_3,P_4,P_5\} is a reduct, and the equivalence class structure is


    and same with the P = \{P_{1}, P_{2}, P_{3}, P_{4}, P_{5}\} .So we can say the former one is a reduct of the latter one. Moreover, the reduct is not unique and for this instance, \{P_1,P_2,P_5\} is also a reduct for the P = \{P_{1}, P_{2}, P_{3}, P_{4}, P_{5}\} .



    The set of attributes which is common to all reducts is called the core

    for the two reducts \{P_1,P_2,P_5\} and \{P_3,P_4,P_5\}, the common attribute is P_{5} ,which is the core of equivalence-class structure. If P_{5} is drop out of the attribute set, the equivalence-class structure will be changed.


    Note that it is possible for the core to be empty, whicn means that there is no indispensable attribute:any single attribute in such an information system can be deleted without altering the equivalence-class structure. In such cases, there is no essential or necessary attribute which is required for the class structure to be represented.


    后面还有Attribute dependency, Rule extraction, Incomplete data三部分,待以后再看。

    ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

    时隔多日,终于又可以开始搞自己的小研究了,最近读了一篇Rough set 和 DS evidence 相结合文章,其中利用Rough set做权值计算,在这里整理一下:

    基本定义:


    在这里由于文献中使用的公式与wiki里的稍有不同,我们重新定义一下:

    定义一个信息系统:

    其中,

    是一个非空的对象集,也就是说上面的O1, O2,...

    是条件属性集,一共有m个条件属性(Condition Attribute)

    是决定属性集,一共有n个决定属性(Decision Attribute),但通常我们只有一个




    则对于所有条件属性C的等价关系定义为:

    该定义表示为:两个主体对象x,y,如果对于条件属性C中的所有条件属性,他们两个的值都相等,那么x,y就是等价的(也就是说不可分的indiscernible)




    同样的,对于去除了Cj之后(即从条件属性集C中去除第j个条件属性)的等价关系定义为:


    该定义表示,不考虑第j个条件属性,如果x,y两个主体对象所有其他的条件属性都对应相等,那么他们就是等价的或者说是不可分的



    对于决定属性D,同样也有等价关系,定义为:


    该定义也挺简单,就是根据决定属性,只要是所有对应的决定属性的值相等,那么两个主体对象x,y就是等价的




    根据上面的等价条件公式,我们可以得到针对不同条件下的知识系统(Knowledge system)


    主体对象们 U 对于条件属性全集 C 的知识系统,等价于上文中的P = \{P_{1},P_{2},P_{3},P_{4},P_{5}\}



    为主体对象U对于去除了第j个条件属性后的知识系统,为主体对象U对于决定属性的知识系统



    权值计算:

    针对上面的基本定义,下面来计算各个属性的权值(weight,对于决定书信的重要性):


    Definition 1:决定属性 D 对于条件属性全集 C 的依赖度

    公式定义为:


    该公式是根据熵理论定义的,其中:

    表示在得到的知识系统中,每一个子集中主体对象的个数占所有主体对象的百分比
    表示在考虑决策属性与条件属性的关系时,中的一个子集[x] 与中的一个子集[y] 所相交的部分的占[x]的百分比。



    Definition 2:第 j个条件属性的显著性(significance)的计算

    对于第j个条件属性的显著性的计算,是通过将第j个从条件属性集中剔除之后,看看该系统的熵的变化情况来计算的。
    公式定义如下:


    通过计算有与没有第j个条件属性的熵的差值,来得到第j个条件属性的显著性的。



    Definition 3:计算权值(归一化)

    通过上面的Definition 2我们可以计算出所有的条件变量的显著性,根据所得到的显著性,通过归一化得到每个条件属性的权重值

    公式定义为:




    同样的,举个“栗子”:


    假设对7个公司是否会bankrupt,三个经济学专家做出了预测(条件属性|C|=3),同时,根据回溯式研究,我们也得到了这三个公司的真实情况(决定属性|D|=1),具体的数据如下表:

    companyExport 1Export 2Export 3Actual financial condition 11011211113011041111500006000070001

    OK,根据表里的数据我们开始计算每个专家做出的预测的权重是多少啦。


    1. 首先,根据基本概念,我们要计算出要用的各个等价关系和知识系统。



    根据我们计算权重的原理(熵理论),我们要首先计算出在有所有专家都参与的系统中的熵是多少,根据Definition 1,我们可以计算得到:


    根据Definition 2,我们可以计算得到:








    这里需要注意的一点是关于计算公式中 0×ln0 的值的问题,在信息熵(Information Entropy)理论中,0×ln0的值为0。可以参考wiki,传送门:点击打开链接这里贴上这段话:

    根据Definition 3,我们可以得到:


    通过以上的计算过程,我们可以得到:

    Expert 1在决策时的权重为1,而Expert 2和Expert 3的决策权重均为零(也就是说Expert 1说的靠谱,而Expert 2 和 3都不靠谱(相比较Expert 1而言))。


    OK, that's all.

    Thanks for your watching!



  • 1 0
    原创粉丝点击