工作流挖掘：相关问题和方法的研究(12)

来源：互联网发布：cult3d软件下载编辑：程序博客网时间：2024/05/12 04:40

10. 对比和公开的问题

正如第5-9章提到的诸如Emit、Little Thumb、InWoLvE和Process Miner等工具都是由不同的问题驱动的。本章，我们将对这些方法做个对比。我们用相应工具的名称来表示每一中方法。第5章介绍的EMiT^[3]用于探究挖掘的限制（那类工作流过程能被再发现？）。第6章介绍的Little Thumb^[9^，63]用于展示怎样的启发式方法可以用来处理噪音问题。第8章阐述了工具InWoLvE^[26]的基础概念。这一工具的一个显著特征是处理重复任务的能力。第9层介绍了工具Process Miner^[57]，通过重写规则来发现块状结构工作流的属性。接下来，我们将对比EMiT、Little Thumb、InWoLvE和Process Miner所代表的（工作流挖掘）方法。注意在本对比中我们并没有包含工具ExperDiTo（在第7章中描述过的），因为它是基于工具Emit和Little Thumb的并且不提供选择挖掘技术。

对于这些方法的对比，我们集中于9个方面：结构、时间、基本平行、非自由选择、基本回路、任意循环、隐藏任务、重复任务和噪音。对于每一个方面，我们在表4中给出了这4种工具的对比并描述如下：

结构。第一个方面是指目标语言的结构。像Petri网这样的语言是基于图的，而像p-calculus这样的文本语言是面向块的。EMiT和Little Thumb是基于Petri网的，因此是面向图的。InWoLvE也是基于图的，而Process Miner则仅仅面向块语言。

时间。许多日志都记录了事件的时间戳。这一信息能够用于计算诸如等待或时间同步、流时间、采用等的行为指示器。

基本平行。所有的工具能够检测并处理平行。在简单过程中每个与分支对应一个与连接并能被EMiT、Little Thumb、InWoLvE和Process Miner挖掘。然而，为了精确的提取正确的模型，这4个工具的每一个都在过程之上强加了一些条件。

非自由选择。在第5章中提到了非自由选择（NFC）结构用作难于挖掘的工作流模式的例子。就如在[18]中描述的，NFC过程在一个结构中混合了同步和选择。4个工具都不能处理这样的结构。然而，[6，38]指出它们是高度有关的。

基本循环。4个工具都能处理循环。然而，就像对待平行一样，为了确保挖掘模型的正确性，每个工具都给这些循环结构强加了限制。

任意循环。这4个工具都不支持任意循环。例如，工具Process Miner仅仅能包含具有明确的块结构的循环。注意：不是每一个像这样的循环都能建模^[38]。EMiT和Little Thumb最初都存在长度为1或2的循环的问题。这些问题可通过预处理步骤（部分地）解决。注意为了检测短循环，需要更多的观察资料。

隐藏任务。特殊任务的出现可能在日志中没有记录。由于没有这一信息的过程是不完整的，所以这是一个基本问题。尽管检测没有记录的任务出现时不可能的，但指示所谓的“隐藏任务”的出现是便利的。假设工作流语言有特殊的控制任务来对与分支和与连接建模。即使这些控制任务没有记录，也能推断出它们的存在。这4个工具都不支持在结构化方式下隐藏任务的检测。

重复任务。EMiT、Little Thumb和Process Miner假设在工作流中每一项任务尽出现1次。即：相同的任务不能在过程的不同部分使用。（注意：这并不指循环，在1个循环中，过程的相同部分重复执行）。InWoLvE是唯一可以处理该问题的工具。

噪音。术语“噪音”用来指日志不完整或包含错误的情形。如果一个稀有的但并不代表典型的工作流的事件序列发生（即：一个异常），一种类似的情形发生了。这两种情况下结果模型都可能是不正确的（即不代表典型的工作流）。EMiT和Process Miner不具有处理噪音的特性。Little Thumb能通过使用一系列能够细微调节以处理特殊类型噪音的启发式去处理噪音。InWoLvE使用一个随机的模型从稀有行为中识别一般。

表4显示了还存在的一些公开的问题。很少有工具发现时间信息。尽管EMiT从日志中提取等待时间、流时间和采用信息，时间戳不用于改进挖掘效果。类似的，其他像正在变化的数据对象或执行一项任务的人员特性之类的信息都不能被现有的方法所挖掘。已存的方法能够处理诸如基本的平行和循环等基本路径的构建。然而，这些方法在处理包括非自由选择结构、隐藏任务或多重任务在内的高级路径的构建时就失败了。尽管像Little Thumb和InWoLvE这样的工具能够处理一些噪音，经验研究对于评估和改进用到的启发式是必需的。

表4显示的对比能用于定位不同的方法。然而，要真正对比结果需要一些标准例子。第7章讨论了一些能够用于这个目的的小例子。然而，对于一个真正较大的标准和较为现实的例子是必需的。明显地，第4章提到的XML格式能用于存储这些标准例子。

As indicated in Sections 5–9 tools such as EMiT, Little Thumb, InWoLvE, and Process Miner are driven by different problems. In this section, we compare these approaches. To refer to each approach we use the name of the corresponding tool. EMiT [3] was introduced in Section 5 to explore the limits of mining (which class of workflow processes can be rediscovered?). Little Thumb [9,63] was introduced in Section 6 to illustrate how heuristics can be used to tackle the problem of noise. Section 8 presented the concepts the tool InWoLvE [26] is based on. One of the striking features of this tool is the ability to deal with duplicate tasks. Section 9 introduced the Process Miner [57], exploiting the properties of block-structured workflows through rewriting rules. In the remainder, we will compare the approaches represented by EMiT, Little Thumb, InWoLvE, and Process Miner. Note that we do not include the tool ExperDiTo (described in Section 7) in this comparison because it builds on EMiT and Little Thumb and does not offer alternative mining techniques.

To compare the approaches represented by EMiT, Little Thumb, InWoLvE, and Process Miner, we focus on nine aspects: Structure, Time, Basic parallelism, Non-free choice, Basic loops, Arbitrary loops, Hidden tasks, Duplicate tasks, and Noise. For each of these nine aspects we compare the four tools as indicated in Table 4 and described as follows.

Structure. The first aspect refers to the structure of the target language. Languages such as Petri nets [50] are graph-based while textual languages such as p-calculus [45] are block-oriented. EMiT and Little Thumb are based on Petri nets and therefore graph-oriented. InWoLvE is also graph-based and Process Miner is the only block-oriented language.

Time. Many logs also record time stamps of events. This information can be used to calculate performance indicators such as waiting/synchronization times, flow times, utilization, etc.

Basic parallelism. All the tools are able to detect and handle parallelism. Simple processes where each AND-split corresponds to an AND-join can be mined by EMiT, Little Thumb, InWoLvE, and Process Miner. However, each of the four tools imposes requirements on the process in order to correctly extract the right model.

Non-free choice. The Non-free choice (NFC) construct was mentioned in Section 5 as an example of a workflow pattern that is difficult to mine. NFC processes mix synchronization and choice in one construct as described in [18]. None of the four tools can deal with such constructs. Nevertheless, they are highly relevant as indicated in [6,38].

Basic loops. Each of the four tools can deal with loops. However, just like with parallelism, each of the tools imposes restrictions on the structure of these loops in order to guarantee the correctness of the discovered model.

Arbitrary loops. None of the tools supports arbitrary loops. For example, the tool Process Miner can only have loops with a clear block structure. Note that not every loop can be modeled like this cf. [38]. EMiT and Little Thumb initially had problems with loops of length 1 or 2. These problems have been (partially) solved by a preprocessing step. Note that to detect “short loops” more observations are required.

Hidden tasks. Occurrences of specific tasks may not be recorded in the log. This is a fundamental problem since without this information processes are incomplete. Despite the fact that it will never be possible to detect task occurrences that are not recorded, there could be facilities to indicate the presence of a so-called “hidden task”. Suppose that a workflow language has special control tasks to model AND-splits and AND-joins. Even if these control tasks are not logged, one could still deduce their presence. None of the four tools supports the detection of hidden tasks in a structured manner.

Duplicate tasks. EMiT, Little Thumb and Process Miner assume that each task appears only once in the workflow, i.e., the same task cannot be used in two different parts of the processes. (Note that this does not refer to loops. In a loop, the same part of the processes is repeatedly executed.) InWoLvE is the only tool dealing with this issue.

Noise. The term noise is used to refer to the situation where the log is incomplete or contains errors. A similar situation occurs if a rare sequence of events takes place which is not representative for the typical flow of work (i.e., an exception). In both cases the resulting model can be incorrect (i.e., not representing the typical flow of work). EMiT and Process Miner do not offer features for dealing with noise. Little Thumb is able to deal with noise by using a set of heuristics which can be fine-tuned to tackle specific types of noise. InWoLvE uses a stochastic model which allows for the distinguishing common from rare behavior.

Table 4 shows that there are still a number of open problems. Few of the tools exploit timing information. Although EMiT extracts information on waiting times, flow times, and utilization from the log, time stamps are not used to improve the mining result. Similarly, other pieces of information like the data objects being changed or the identity of the person executing a task are not exploited by the existing approaches. Existing approaches can deal with basic routing construct such as basic parallelism and basic loops. However, these approaches fail when facing advanced routing constructs involving non-free choice constructs, hidden tasks, or duplicate tasks. Last but not least, there is the problem of noise. Although tools such as Little Thumb and InWoLvE can deal with some noise, empirical research is needed to evaluate and improve the heuristics being used.

The comparison shown in Table 4 can be used to position the various approaches. However, to truly compare the results there should be a number of benchmark examples. Section 7 discussed a number of small experiments that can be used for this purpose. However, for a real benchmark larger and more realistic examples are needed. Clearly, the XML format presented in Section 4 can be used to store these benchmark examples.