工作流挖掘：相关问题和方法的研究(8)

来源：互联网发布：分类信息系统 php源码编辑：程序博客网时间：2024/04/28 10:49

6. 如何处理噪音和不完整的日志：启发式方法

上述章节提到的常规方法是以下面的理想信息为前提的：（i）日志必须完整（即：如果一项任务能够直接跟在另一项任务之后，日志就应该包含这一行为的一个例子）；（ii）假设日志中不存在噪音（即：日志中记录的都是正确的）。然而，实际情况下很少有完整的或没有噪音的日志。因此在实际情况下，要确定a、b两个事件之间是否保持3种基本关系（即：a-->_Wb、a#_Wb和a||_Wb）之一变得越来越困难。例如，两个任务之间要保持因果关系（a-->_Wb）,当前仅当日志中存在一种路径，在此路径下a紧跟b（即：ab之间存在关系a>_Wb），并且日志中不存在相反的路径-b在a之后（即：不存在b>_Wa）。然而，在噪音情况下，一种错误的情况会把一个正确结论的来历完全扰乱。因此，我们尝试开发一种启发式挖掘技术，这种技术对日志的噪音和不完整性缺乏敏感性。进一步的，我们也试着避免α算法（即：特定种类的循环和非自由选择结构）的一些其他限制。

我们所说的启发式方法[43，61，62]分为3个挖掘步骤：步骤（i）依赖/频率表（D/F表）的构建，步骤（ii）基于D/F表的基本关系（R表）的挖掘，和步骤（iii）基于R表的工作流网络的重构。

6.1 D/F表的构建

构建一张D/F表示我们所述工作流挖掘技术的起点。对于各项任务a可以从工作流日志中抽象出如下信息：（i）任务a的全部频率（记作#A³），（ii）任务a在任务b之前的频率（记作#B<A），（iii）a紧跟在任务b之后的频率（记作#A>B），（iv）a在b的先前出现之前，直接或间接的出现在任务b之前的频率（记作#B<<<A），（v）在a下一次出现之前，a直接或间接的出现在任务b之后的概率（记作#A>>>B），以及（vi）用于体现任务a和b之间因果关系强度的关系式（记作#A-->B）。

关系式（i）到（v）似乎很明确而无需额外的解释。对关系式（vi）直观的解释如下。通常的情况是，当任务a发生，紧接着任务b也发生了，就好像是a引起了b。反之，如果b发生于a之前，任务a似乎就不是b发生的原因。下面是我们假设的情形，在一个事件流中，任务a发生于任务b之前，n是a、b之间所发生事件的数目，因果关系#AàB计数器以因子δⁿ递增（δ是因果关系下落因子，δ∈[0.0 …1.0]）。我们的实验中δ=0.8，对于因果关系最大值为1，如果任务b紧跟在任务a后出现，则n=0且δⁿ =1。δⁿ随着（a、b之间）距离的增加而减小。在任务a或b首次出现之后从a到b的演变过程将停止。设任务b出现于a之前，n仍为a、b之间事件的数目，关系#AàB计数器随着（下落因子）δⁿ递减。

在处理了整个工作流日志之后，因果关系#AàB计数器被全部任务a和b的频率的最小值（即：min(#A,#B)）分开。特别地，如果a紧跟在b之后时在日志中没有记录（即：日志不完整），#AàB的值可能较大。

6.2 基于D/F表的基本关系表

通过相对简单的启发式方法，在D/F表的基础上我们能够确定基本关系（a-->_Wb, a#_Wb和a||_Wb）。作为例子，我们来看看前面章节和参考文献[61]中提到的一种针对关系a-->_Wb的启发式规则。

IF((#A-->B≥N) AND(#A>B≥θ) AND(#B<A≤θ)) THEN a-->_Wb

第一个条件（#AàB≥N）用到了噪音因子N（默认为0.05）。如果希望更大的噪音，我们可以提高该因子的值。第一个条件表示a和b之间存在高于噪音值的积极的因果关系。第二个条件保持了一个阈值θ。如果知道我们得到的工作流日志是完全没有噪音的，那么每个任务-模式-出现都是非正式的。然而，为了实现基于噪音推论的推导过程，仅仅是在高于阈值的频率值N才足够可靠。为了限制参数的个数，阈值θ可以通过下面的等式自动计算：θ=1+(Round(Nx#L)/#T)。在该表达式中，N是噪音因子，#L是日志中跟踪线索的个数，#T是元素（不同任务）的个数。通过这些启发式规则，我们可以构建关系a -->_W b并且将所谓的关系表（R表）中的结果分组。

6.3 基于R表重构工作流网络

我们所述的启发式方法的第3步，使用了与通常方法相同的α算法。最终的结果是一个过程模型（即：Petri网）。一个可能的额外步骤，我们使用任务频率来检查任务出现的数量是不是与结果Petri网一致。

为了测试该方法，我们使用不同自由选择工作流模型的Petri网表示法。

所有模型包含一致的过程和循环。对于每个模型我们随机生成3个包含1000个时间序列的工作流日志：（i）没有噪音，（ii）包含5%的噪音和（iii）包含10%的噪音。下面，我们给出了噪音的含义。为了在工作流日志中吸收噪音，我们定义了4种不同类型的噪音生成步骤：（i）删掉一个时间序列的头，（ii）删掉一个序列的尾，（iii）删掉主体的一部分，还有（iv）交换随机选择的2个事件。在删除操作的过程中，至少删掉1个事件，最多删掉占序列三分之一的事件。产生包含5%噪音的工作流日志的第一步是一个正常的随机产生的工作流日志，接下来从5%原始事件序列中随机选择一部分，并在此基础上使用上面提到的4个噪音生成操作之一（每个噪音生成操作具有1/4的机会）。在没有噪音的素材上使用本章中提到的方法之后，我们可以得到基于WFnets的一模一样的副本。

即使我们在工作流日志中添加了5%的噪音，结果工作流网络仍然是完美（没有错误）的。然而，如果噪音增加到10%，工作流网络就包含错误了。所有的错误是由低阈值引起的。如果我们将噪音因子增加到较高的值（N=0.10），所有的错误都消失了。更多的细节可参考文献[61]。

第一种方法的不利之处是使用了阈值。（对此，）我们有2种可能的解决方案：（1）使用机器学习方法自动推导一个积极的阈值^[43]，以及（2）使用其它不使用阈值的度量和规则。其中的一些启发式方法在启发式工作流挖掘工具Little Thumb（小拇指，该工具以童话“Little Thumb”命名，童话中讲述了一个不到一拇指高的男孩的故事，一开始离开小石头们去找他回去的路，之后，男孩又丢下了被鸟儿吃了的面包屑。石头是指使用没有噪音的完整日志进行挖掘，后面的情形指通过具有噪音的不完整日志进行挖掘。另一种比喻是该工具使用“拇指规则”来提取因果关系）中得到了实现。Little Thumb遵从了第4部分提到的XML输入标准。

[3] 注：我们用大写字母来表示一些任务的数目。

6. How to deal with noise and incomplete logs: Heuristic approaches

The formal approach presented in the preceding section presupposes perfect information: (i)the log must be complete (i.e., if a task can follow another task directly, the log should contain an example of this behavior) and (ii) we assume that there is no noise in the log (i.e., everything that is registered in the log is correct). However, in practical situations logs are rarely complete and/or noise free. Therefore, in practical situations, it becomes more difficult to decide if between two events say a, b one of the three basic relations (i.e., a à_W b, a #_W b, and a ||_W b) holds. For instance the causality relation (a à_W b) between two tasks a and b only holds if and only if in the log there is a trace in which a is directly followed by b (i.e., the relation a >_W b holds) and there is no trace in which b is directly followed by a (i.e., not b >_W a). However, in a noisy situation one erroneous example can completely mess up the derivation of a right conclusion. For this reason we try to developed heuristic mining techniques which are less sensitive for noise and the incompleteness of logs. Moreover, we try to conquer some other limitations of the α algorithm (e.g., certain kinds of loops and non-free-choice constructs).

In our heuristic approaches [43,61,62] we distinguish three mining steps: Step (i) the construction of a dependency/frequency table (D/F-table), Step (ii) the mining of the basic relations out of the D/F-table (the mining of the R-table), and Step (iii) the reconstruction of the WF-net out of the R-table.

6.1. Construction of the dependency/frequency table

The starting point in our workflow mining techniques is the construction of a D/F-table. For each task a the following information is abstracted out of the workflow log: (i) the overall frequency of task a (notation #A^[3]), (ii) the frequency of task a directly preceded by task b (notation #B < A), (iii) the frequency of a directly followed by task b (notation #A > B), (iv) the frequency of a directly or indirectly preceded by task b but before the previous appearance of b (notation #B<<<A), (v) the frequency of a directly or indirectly followed by task b but before the next appearance of a (notation #A>>>B), and finally (vi) a metric that indicates the strength of the causal relation between task a and another task b (notation #A -->B).

Metrics (i) through (v) seem clear without extra explanation. The underlying intuition of metric (vi) is as follows. If it is always the case that, when task a occurs, shortly later task b also occurs, then it is plausible that task a causes the occurrence of task b. On the other hand, if task b occurs (shortly) before task a, it is implausible that task a is the cause of task b. Below we define the formalization of this intuition. If, in an event stream, task a occurs before task b and n is the number of intermediary events between them, the #A--> B-causality counter is incremented with a factor δⁿ (δ is a causality fall factor and δ∈[0.0 …1.0]). In our experiments δ is set to 0.8. The effect is that the contribution to the causality metric is maximal 1 if task b appears directly after task a then n=0 and δⁿ =1 and decreases if the distance increases. The process of looking forward from task a to the occurrence of task b stops after the first occurrence of task a or task b. If task b occurs before task a and n is again the number of intermediary events between them, the #A --> B-causality counter is decreased with a factor δⁿ.

After processing the whole workflow log the #A--> B-causality counter is divided by the minimum overall frequency of task a and b (i.e., min(#A,#B)). Note that the value of #A --> B can be relatively high even when there is no trace in the log in which a is directly followed by b (i.e., the log is not complete).

6.2. The basic relations table (R-table) out of the D/F-table

Using relatively simple heuristics, we can determine the basic relations (a -->_W b, a#_W b, and a||_W b) out of the D/F-table. As an example we look at a heuristic rule for the a à_W b-relation as presented in the previous section and [61].

IF((#A-->B≥N) AND(#A>B≥θ) AND(#B<A≤θ)) THEN a -->_W b

The first condition (#A-->B≥N) uses the noise factor N (default value 0.05). If we expect more noise, we can increase this factor. The first condition calls for a higher positive causality between task a and b than the value of the noise factor. The second condition (#A > B≥θ) contains a threshold valueθ. If we know that we have a workflow log that is totally noise free, then every task-pattern-occurrence is informative. However, to protect our induction process against inferences based on noise, only task-pattern-occurrences above a threshold frequency N are reliable enough for our induction process. To limit the number of parameters the value θ is automatically calculated using the following equation: θ=1+(Round(Nx#L)/#T). In this expression N is the noise factor, #L is the number of trace lines in the workflow log, and #T is the number of elements (different tasks). Using these heuristic rules we can build a -->_Wb-relation and group the results in the so-called relations table (R-table).

6.3. The reconstruction of the WF-net out of the R-table

In step (iii) of our heuristic approaches, we can use the same a algorithm as in the formal approach. The result is a process model (i.e., Petri net). In a possible extra step, we use the task frequency to check if the number of task-occurrences is consistent with the resulting Petri-net.

To test the approach we use Petri-net-representations of different free-choice workflow models.

All models contain concurrent processes and loops. For each model we generated three random workflow logs with 1000 event sequences: (i) a workflow log without noise, (ii) one with 5% noise, and (iii) a log with 10% noise. Below we explain what we mean with noise. To incorporate noise in our workflow logs we define four different types of noise generating operations: (i) delete the head of a event sequence, (ii) delete the tail of a sequence, (iii) delete a part of the body, and finally (iv) interchange two random chosen events. During the deletion-operations at least one event and at most one third of the sequence is deleted. The first step in generating a workflow log with 5% noise is a normal random generated workflow log. The next step is the random selection of 5% of the original event sequences and applying one of the four above described noise generating operations on it (each noise generation operation with an equal chance of 1/4). Applying the method presented in this section on the material without noise we found exact copies of the underlying WFnets.

If we add 5% noise to the workflow logs, the resulting WF-nets are still perfect. However, if we add 10% noise to the workflow logs the WF-nets contains errors. All errors are caused by the low threshold value. If we increase the noise factor value to a higher value (N=0.10), all errors disappear. For more details we refer to [61].

The use of a threshold value is a disadvantage of the first approach. We are working on two possible solutions: (1) the use of machine learning techniques for automatically induction of an optimal threshold [43], and (2) the formulation of other measurements and rules without thresholds. Some of these heuristics are implemented in the heuristic workflow mining tool Little Thumb. (The tool is named after the fairy tail “Little Thumb” where a boy, not taller than a thumb, first leaves small stones to find his way back. The stones refer to mining using complete logs without noise. Then the boy leaves bread crusts that are partially eaten by birds. The latter situation refer to mining with incomplete logs with noise. Another analogy is the observation that the tool uses “rules of thumb” to extract causal relations.) Little Thumb follows the XML-input standard presented in Section 4.

[3] Note that we use a capital letter when referring to the number of occurrences of some task.