6. 如何处理噪音和不完整的日志:启发式方法



    6.1 D/F表的构建


关系式(i)到(v)似乎很明确而无需额外的解释。对关系式(vi)直观的解释如下。通常的情况是,当任务a发生,紧接着任务b也发生了,就好像是a引起了b。反之,如果b发生于a之前,任务a似乎就不是b发生的原因。下面是我们假设的情形,在一个事件流中,任务a发生于任务b之前,nab之间所发生事件的数目,因果关系#AàB计数器以因子δn递增(δ是因果关系下落因子,δ∈[0.0 1.0])。我们的实验中δ=0.8,对于因果关系最大值为1,如果任务b紧跟在任务a后出现,则n=0且δn =1。δn随着(ab之间)距离的增加而减小。在任务ab首次出现之后从ab的演变过程将停止。设任务b出现于a之前,n仍为ab之间事件的数目,关系#AàB计数器随着(下落因子)δn递减。


6.2 基于D/F表的基本关系表

通过相对简单的启发式方法,在D/F表的基础上我们能够确定基本关系(a-->Wb, a#Wba||Wb)。作为例子,我们来看看前面章节和参考文献[61]中提到的一种针对关系a-->Wb的启发式规则。

IF((#A-->BN) AND(#A>B≥θ) AND(#B<A≤θ)) THEN a-->Wb

第一个条件(#AàBN)用到了噪音因子N(默认为0.05)。如果希望更大的噪音,我们可以提高该因子的值。第一个条件表示ab之间存在高于噪音值的积极的因果关系。第二个条件保持了一个阈值θ。如果知道我们得到的工作流日志是完全没有噪音的,那么每个任务-模式-出现都是非正式的。然而,为了实现基于噪音推论的推导过程,仅仅是在高于阈值的频率值N才足够可靠。为了限制参数的个数,阈值θ可以通过下面的等式自动计算:θ=1+(Round(Nx#L)/#T)。在该表达式中,N是噪音因子,#L是日志中跟踪线索的个数,#T是元素(不同任务)的个数。通过这些启发式规则,我们可以构建关系a -->W b并且将所谓的关系表(R表)中的结果分组。

6.3 基于R表重构工作流网络





第一种方法的不利之处是使用了阈值。(对此,)我们有2种可能的解决方案:(1)使用机器学习方法自动推导一个积极的阈值[43],以及(2)使用其它不使用阈值的度量和规则。其中的一些启发式方法在启发式工作流挖掘工具Little Thumb(小拇指,该工具以童话“Little Thumb”命名,童话中讲述了一个不到一拇指高的男孩的故事,一开始离开小石头们去找他回去的路,之后,男孩又丢下了被鸟儿吃了的面包屑。石头是指使用没有噪音的完整日志进行挖掘,后面的情形指通过具有噪音的不完整日志进行挖掘。另一种比喻是该工具使用“拇指规则”来提取因果关系)中得到了实现。Little Thumb遵从了第4部分提到的XML输入标准。

[3] 注:我们用大写字母来表示一些任务的数目。

6. How to deal with noise and incomplete logs: Heuristic approaches

The formal approach presented in the preceding section presupposes perfect information: (i)the log must be complete (i.e., if a task can follow another task directly, the log should contain an example of this behavior) and (ii) we assume that there is no noise in the log (i.e., everything that is registered in the log is correct). However, in practical situations logs are rarely complete and/or noise free. Therefore, in practical situations, it becomes more difficult to decide if between two events say a, b one of the three basic relations (i.e., a àW b, a #W b, and a ||W b) holds. For instance the causality relation (a àW b) between two tasks a and b only holds if and only if in the log there is a trace in which a is directly followed by b (i.e., the relation a >W b holds) and there is no trace in which b is directly followed by a (i.e., not b >W a). However, in a noisy situation one erroneous example can completely mess up the derivation of a right conclusion. For this reason we try to developed heuristic mining techniques which are less sensitive for noise and the incompleteness of logs. Moreover, we try to conquer some other limitations of the α algorithm (e.g., certain kinds of loops and non-free-choice constructs).

In our heuristic approaches [43,61,62] we distinguish three mining steps: Step (i) the construction of a dependency/frequency table (D/F-table), Step (ii) the mining of the basic relations out of the D/F-table (the mining of the R-table), and Step (iii) the reconstruction of the WF-net out of the R-table.

6.1. Construction of the dependency/frequency table

The starting point in our workflow mining techniques is the construction of a D/F-table. For each task a the following information is abstracted out of the workflow log: (i) the overall frequency of task a (notation #A[3]), (ii) the frequency of task a directly preceded by task b (notation #B < A), (iii) the frequency of a directly followed by task b (notation #A > B), (iv) the frequency of a directly or indirectly preceded by task b but before the previous appearance of b (notation #B<<<A), (v) the frequency of a directly or indirectly followed by task b but before the next appearance of a (notation #A>>>B), and finally (vi) a metric that indicates the strength of the causal relation between task a and another task b (notation #A -->B).

Metrics (i) through (v) seem clear without extra explanation. The underlying intuition of metric (vi) is as follows. If it is always the case that, when task a occurs, shortly later task b also occurs, then it is plausible that task a causes the occurrence of task b. On the other hand, if task b occurs (shortly) before task a, it is implausible that task a is the cause of task b. Below we define the formalization of this intuition. If, in an event stream, task a occurs before task b and n is the number of intermediary events between them, the #A--> B-causality counter is incremented with a factor δn (δ is a causality fall factor and δ[0.0 1.0]). In our experiments δ is set to 0.8. The effect is that the contribution to the causality metric is maximal 1 if task b appears directly after task a then n=0 and δn =1 and decreases if the distance increases. The process of looking forward from task a to the occurrence of task b stops after the first occurrence of task a or task b. If task b occurs before task a and n is again the number of intermediary events between them, the #A --> B-causality counter is decreased with a factor δn.

After processing the whole workflow log the #A--> B-causality counter is divided by the minimum overall frequency of task a and b (i.e., min(#A,#B)). Note that the value of #A --> B can be relatively high even when there is no trace in the log in which a is directly followed by b (i.e., the log is not complete).

6.2. The basic relations table (R-table) out of the D/F-table

Using relatively simple heuristics, we can determine the basic relations (a -->W b, a#W b, and a||W b) out of the D/F-table. As an example we look at a heuristic rule for the a àW b-relation as presented in the previous section and [61].

IF((#A-->BN) AND(#A>B≥θ) AND(#B<A≤θ)) THEN a -->W b

The first condition (#A-->BN) uses the noise factor N (default value 0.05). If we expect more noise, we can increase this factor. The first condition calls for a higher positive causality between task a and b than the value of the noise factor. The second condition (#A > B≥θ) contains a threshold valueθ. If we know that we have a workflow log that is totally noise free, then every task-pattern-occurrence is informative. However, to protect our induction process against inferences based on noise, only task-pattern-occurrences above a threshold frequency N are reliable enough for our induction process. To limit the number of parameters the value θ is automatically calculated using the following equation: θ=1+(Round(Nx#L)/#T). In this expression N is the noise factor, #L is the number of trace lines in the workflow log, and #T is the number of elements (different tasks). Using these heuristic rules we can build a -->W b-relation and group the results in the so-called relations table (R-table).

6.3. The reconstruction of the WF-net out of the R-table

In step (iii) of our heuristic approaches, we can use the same a algorithm as in the formal approach. The result is a process model (i.e., Petri net). In a possible extra step, we use the task frequency to check if the number of task-occurrences is consistent with the resulting Petri-net.

To test the approach we use Petri-net-representations of different free-choice workflow models.

All models contain concurrent processes and loops. For each model we generated three random workflow logs with 1000 event sequences: (i) a workflow log without noise, (ii) one with 5% noise, and (iii) a log with 10% noise. Below we explain what we mean with noise. To incorporate noise in our workflow logs we define four different types of noise generating operations: (i) delete the head of a event sequence, (ii) delete the tail of a sequence, (iii) delete a part of the body, and finally (iv) interchange two random chosen events. During the deletion-operations at least one event and at most one third of the sequence is deleted. The first step in generating a workflow log with 5% noise is a normal random generated workflow log. The next step is the random selection of 5% of the original event sequences and applying one of the four above described noise generating operations on it (each noise generation operation with an equal chance of 1/4). Applying the method presented in this section on the material without noise we found exact copies of the underlying WFnets.

If we add 5% noise to the workflow logs, the resulting WF-nets are still perfect. However, if we add 10% noise to the workflow logs the WF-nets contains errors. All errors are caused by the low threshold value. If we increase the noise factor value to a higher value (N=0.10), all errors disappear. For more details we refer to [61].

The use of a threshold value is a disadvantage of the first approach. We are working on two possible solutions: (1) the use of machine learning techniques for automatically induction of an optimal threshold [43], and (2) the formulation of other measurements and rules without thresholds. Some of these heuristics are implemented in the heuristic workflow mining tool Little Thumb. (The tool is named after the fairy tail “Little Thumb” where a boy, not taller than a thumb, first leaves small stones to find his way back. The stones refer to mining using complete logs without noise. Then the boy leaves bread crusts that are partially eaten by birds. The latter situation refer to mining with incomplete logs with noise. Another analogy is the observation that the tool uses “rules of thumb” to extract causal relations.) Little Thumb follows the XML-input standard presented in Section 4.

[3] Note that we use a capital letter when referring to the number of occurrences of some task.
