序列模式PrefixSpan算法介绍

来源:互联网 发布:淘宝如何屏蔽卖家 编辑:程序博客网 时间:2024/06/04 19:49

序列

序列(sequence)是一组排好序的项集,不一定是直接连续的,但依然满足次序。序列模式的元素还可以是一个项集,如一组页面序列。序列模式挖掘比关联挖掘能得到更深刻的知识。


序列模式

sequence patternmining,针对Frequent Sequences,典型的应用还是限于离散型的序列,happens-after relationship and not just the consecutive subsequences。

可用于购买行为预测、欺诈甄别、故障预测、Web用户访问预测、人类行为规律等。

算法是各种类APRIORI算法,有AprioriAll、AprioriSome、GSP (Generalized Sequential Patterns)、SPADE(Sequential PAttern Discovery using Equivalence classes)、PrefixSpan。



与时间序列的区别

与time series mining时间序列不同,时间序列(或称动态数列)是指将同一统计指标的数值按其发生的时间先后顺序排列而成的数列。时间序列分析的主要目的是根据已有的历史数据对未来进行预测。常见MA、AR、ARMA、GARCH模型。


例子

<a(abc)(ac)d(cf)>- 9 items(项), 5 itemsets(项集), 1 sequence(序列)

<a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)>

<a(abc)(ac)d(cf)> ≠ <a(ac)(abc)d(cf)>

Min support(最小支持度)threshold - 频繁子序列的频繁度不低于最小支持度(Find all the frequent subsequences,i.e. the subsequences whose occurrence frequency in the set of sequences is noless than min_support)

Supersequence: <a(abc)(ac)d(cf)>

Sub-sequence:<aa(ac)d(c)>

Sub-sequence:<(ac)(ac)d(cf)>

Sub-sequence:<ac>

 

<a(abc)(ac)d(cf)>  α1=<a> support(α1) = 4

<(ad)c(bc)(ae)>    α2=<ac> support(α2) = 4

<(ef)(ab)(df)cb>   α3=<(ab)c> support(α3) = 2

<eg(af)cbc>


=================

PrefixSpan

前缀prefix

seq <a(abc)a> is a prefix of seq<a(abc)(ac)d(cf)>, but seq <a(abc)c> is NOT.

<a>、<aa>、<a(ab)>、<a(abc)>是序列<a(abc)(ac)d(cf)>的前缀,而<ab>、<a(bc)>不是。


后缀postfix

Seqβ <a(abc)a> is a prefix and seqγ <(_c)d(cf)> is a postfix of seqα<a(abc)(ac)d(cf)>. Denote α =β⋅γ或γ=α/β


对于序列<a(abc)(ac)d(cf)>,

<(abc)(ac)d(cf)>是前缀<a>的后缀;

<(_bc)(ac)d(cf)>是前缀<aa>的后缀;

<(_c)(ac)d(cf)>是前缀<a(ab)>的后缀;

"_"下标符代表前缀。


投影Projection

投影即投影数据库,是序列数据库S中所有相对于α前缀的后缀序列的集合。


算法

子程序:PrefixSpan(α,l, )

参数:

α指前缀序列模式;

l指α的长度;

指α的投影数据库。

算法:

1、  扫描,找出频繁项集b:
  a)b可以成为α的最后一个项集(如ab + c=> abc),或者:
  b)b可以追加到α形成新一个序列模式(如ab +_c => a(bc));

2、  对于每个频繁项b,追加到α形成新一个序列模式α’(如abc或a(bc));

3、  对于每个α’,构造α’的投影数据库,并调用prefixspan(α’, l+1, )

其过程为深度优先搜索。

优点:

1)不产生任何的侯选集,减少空间;

2)投影数据库规模不断减少(因为投影仅发生在与前缀相关的后缀部分);

3)采用分而治之的方法,提高了算法效率,而且与SPADE和GSP算法相比,在内存使用上更加的稳定;

缺点:

1)算法主要开销在于投影数据库的构造,如果序列多且每个序列建立一个投影数据库,那么开销就比较大(可通过a.伪投影技术(Pseudo-projection)减少投影数据库的数量和大小;b.Bi-level projection);

2)实现难度较大。


例子



部分演示过程



最终结果



代码打印结果(可用于调试对照)

MIN_SUPPORT: 2


Input Sequence:a (abc) (ac) d (cf)
Input Sequence: (ad) c (bc) (ae)
Input Sequence: (ef) (ab) (df) c b
Input Sequence: e g (af) c b c
frequence: a=4 b=4 c=4 d=3 e=3 f=3 g=1 
support: a=4 b=4 c=4 d=3 e=3 f=3 
fullPrefix~~~: a
lastPrefix: a, postFix: (abc) (ac) d (cf)
lastPrefix: a, postFix: (_d) c (bc) (ae)
lastPrefix: a, postFix: (_b) (df) c b
lastPrefix: a, postFix: (_f) c b c


Input Sequence:(abc) (ac) d (cf)
Input Sequence:(_d) c (bc) (ae)
Input Sequence:(_b) (df) c b
Input Sequence:(_f) c b c
frequence: a=2 b=4 _b=2 c=4 _c=1 d=2 _d=1 e=1 f=2 _e=1 _f=1 
support: a=2 b=4 _b=2 c=4 d=2 f=2 
fullPrefix~~~: aa
lastPrefix: a, postFix: (_bc) (ac) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_bc) (ac) d (cf)
Input Sequence:(_e)
frequence: a=1 _b=1 c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: ab
lastPrefix: b, postFix: (_c) (ac) d (cf)
lastPrefix: b, postFix: (_c) (ae)
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:(_c) (ac) d (cf)
Input Sequence:(_c) (ae)
Input Sequence:
Input Sequence:c
frequence: a=2 c=2 _c=2 d=1 e=1 f=1 
support: a=2 c=2 _c=2 
fullPrefix~~~: aba
lastPrefix: a, postFix: (_c) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_c) d (cf)
Input Sequence:(_e)
frequence: c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: abc
lastPrefix: c, postFix: d (cf)
lastPrefix: c, postFix: 


Input Sequence:d (cf)
Input Sequence:
frequence: c=1 d=1 f=1 _f=1 
support: 
fullPrefix~~~: a(bc)
lastPrefix: _c, postFix: (ac) d (cf)
lastPrefix: _c, postFix: (ae)


Input Sequence:(ac) d (cf)
Input Sequence:(ae)
frequence: a=2 c=1 d=1 e=1 f=1 
support: a=2 
fullPrefix~~~: a(bc)a
lastPrefix: a, postFix: (_c) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_c) d (cf)
Input Sequence:(_e)
frequence: c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: (ab)
lastPrefix: _b, postFix: (_c) (ac) d (cf)
lastPrefix: _b, postFix: (df) c b


Input Sequence:(_c) (ac) d (cf)
Input Sequence:(df) c b
frequence: a=1 b=1 c=2 _c=1 d=2 f=2 
support: c=2 d=2 f=2 
fullPrefix~~~: (ab)c
lastPrefix: c, postFix: d (cf)
lastPrefix: c, postFix: b


Input Sequence:d (cf)
Input Sequence:b
frequence: b=1 c=1 d=1 f=1 _f=1 
support: 
fullPrefix~~~: (ab)d
lastPrefix: d, postFix: (cf)
lastPrefix: d, postFix: (_f) c b


Input Sequence:(cf)
Input Sequence:(_f) c b
frequence: b=1 c=2 f=1 _f=1 
support: c=2 
fullPrefix~~~: (ab)dc
lastPrefix: c, postFix: (_f)
lastPrefix: c, postFix: b


Input Sequence:(_f)
Input Sequence:b
frequence: b=1 _f=1 
support: 
fullPrefix~~~: (ab)f
lastPrefix: f, postFix: 
lastPrefix: f, postFix: c b


Input Sequence:
Input Sequence:c b
frequence: b=1 c=1 
support: 
fullPrefix~~~: ac
lastPrefix: c, postFix: (ac) d (cf)
lastPrefix: c, postFix: (bc) (ae)
lastPrefix: c, postFix: b
lastPrefix: c, postFix: b c


Input Sequence:(ac) d (cf)
Input Sequence:(bc) (ae)
Input Sequence:b
Input Sequence:b c
frequence: a=2 b=3 c=3 d=1 e=1 f=1 _f=1 
support: a=2 b=3 c=3 
fullPrefix~~~: aca
lastPrefix: a, postFix: (_c) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_c) d (cf)
Input Sequence:(_e)
frequence: c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: acb
lastPrefix: b, postFix: (_c) (ae)
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:(_c) (ae)
Input Sequence:
Input Sequence:c
frequence: a=1 c=1 _c=1 e=1 
support: 
fullPrefix~~~: acc
lastPrefix: c, postFix: d (cf)
lastPrefix: c, postFix: (ae)
lastPrefix: c, postFix: 


Input Sequence:d (cf)
Input Sequence:(ae)
Input Sequence:
frequence: a=1 c=1 d=1 e=1 f=1 _f=1 
support: 
fullPrefix~~~: ad
lastPrefix: d, postFix: (cf)
lastPrefix: d, postFix: (_f) c b


Input Sequence:(cf)
Input Sequence:(_f) c b
frequence: b=1 c=2 f=1 _f=1 
support: c=2 
fullPrefix~~~: adc
lastPrefix: c, postFix: (_f)
lastPrefix: c, postFix: b


Input Sequence:(_f)
Input Sequence:b
frequence: b=1 _f=1 
support: 
fullPrefix~~~: af
lastPrefix: f, postFix: 
lastPrefix: f, postFix: c b


Input Sequence:
Input Sequence:c b
frequence: b=1 c=1 
support: 
fullPrefix~~~: b
lastPrefix: b, postFix: (_c) (ac) d (cf)
lastPrefix: b, postFix: (_c) (ae)
lastPrefix: b, postFix: (df) c b
lastPrefix: b, postFix: c


Input Sequence:(_c) (ac) d (cf)
Input Sequence:(_c) (ae)
Input Sequence:(df) c b
Input Sequence:c
frequence: a=2 b=1 c=3 _c=2 d=2 e=1 f=2 
support: a=2 c=3 _c=2 d=2 f=2 
fullPrefix~~~: ba
lastPrefix: a, postFix: (_c) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_c) d (cf)
Input Sequence:(_e)
frequence: c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: bc
lastPrefix: c, postFix: d (cf)
lastPrefix: c, postFix: b
lastPrefix: c, postFix: 


Input Sequence:d (cf)
Input Sequence:b
Input Sequence:
frequence: b=1 c=1 d=1 f=1 _f=1 
support: 
fullPrefix~~~: (bc)
lastPrefix: _c, postFix: (ac) d (cf)
lastPrefix: _c, postFix: (ae)


Input Sequence:(ac) d (cf)
Input Sequence:(ae)
frequence: a=2 c=1 d=1 e=1 f=1 
support: a=2 
fullPrefix~~~: (bc)a
lastPrefix: a, postFix: (_c) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_c) d (cf)
Input Sequence:(_e)
frequence: c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: bd
lastPrefix: d, postFix: (cf)
lastPrefix: d, postFix: (_f) c b


Input Sequence:(cf)
Input Sequence:(_f) c b
frequence: b=1 c=2 f=1 _f=1 
support: c=2 
fullPrefix~~~: bdc
lastPrefix: c, postFix: (_f)
lastPrefix: c, postFix: b


Input Sequence:(_f)
Input Sequence:b
frequence: b=1 _f=1 
support: 
fullPrefix~~~: bf
lastPrefix: f, postFix: 
lastPrefix: f, postFix: c b


Input Sequence:
Input Sequence:c b
frequence: b=1 c=1 
support: 
fullPrefix~~~: c
lastPrefix: c, postFix: (ac) d (cf)
lastPrefix: c, postFix: (bc) (ae)
lastPrefix: c, postFix: b
lastPrefix: c, postFix: b c


Input Sequence:(ac) d (cf)
Input Sequence:(bc) (ae)
Input Sequence:b
Input Sequence:b c
frequence: a=2 b=3 c=3 d=1 e=1 f=1 _f=1 
support: a=2 b=3 c=3 
fullPrefix~~~: ca
lastPrefix: a, postFix: (_c) d (cf)
lastPrefix: a, postFix: (_e)


Input Sequence:(_c) d (cf)
Input Sequence:(_e)
frequence: c=1 _c=1 d=1 f=1 _e=1 
support: 
fullPrefix~~~: cb
lastPrefix: b, postFix: (_c) (ae)
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:(_c) (ae)
Input Sequence:
Input Sequence:c
frequence: a=1 c=1 _c=1 e=1 
support: 
fullPrefix~~~: cc
lastPrefix: c, postFix: d (cf)
lastPrefix: c, postFix: (ae)
lastPrefix: c, postFix: 


Input Sequence:d (cf)
Input Sequence:(ae)
Input Sequence:
frequence: a=1 c=1 d=1 e=1 f=1 _f=1 
support: 
fullPrefix~~~: d
lastPrefix: d, postFix: (cf)
lastPrefix: d, postFix: c (bc) (ae)
lastPrefix: d, postFix: (_f) c b


Input Sequence:(cf)
Input Sequence:c (bc) (ae)
Input Sequence:(_f) c b
frequence: a=1 b=2 c=3 e=1 f=1 _f=1 
support: b=2 c=3 
fullPrefix~~~: db
lastPrefix: b, postFix: (_c) (ae)
lastPrefix: b, postFix: 


Input Sequence:(_c) (ae)
Input Sequence:
frequence: a=1 _c=1 e=1 
support: 
fullPrefix~~~: dc
lastPrefix: c, postFix: (_f)
lastPrefix: c, postFix: (bc) (ae)
lastPrefix: c, postFix: b


Input Sequence:(_f)
Input Sequence:(bc) (ae)
Input Sequence:b
frequence: a=1 b=2 c=1 e=1 _f=1 
support: b=2 
fullPrefix~~~: dcb
lastPrefix: b, postFix: (_c) (ae)
lastPrefix: b, postFix: 


Input Sequence:(_c) (ae)
Input Sequence:
frequence: a=1 _c=1 e=1 
support: 
fullPrefix~~~: e
lastPrefix: e, postFix: 
lastPrefix: e, postFix: (_f) (ab) (df) c b
lastPrefix: e, postFix: g (af) c b c


Input Sequence:
Input Sequence:(_f) (ab) (df) c b
Input Sequence:g (af) c b c
frequence: a=2 b=2 c=2 d=1 f=2 _f=1 g=1 
support: a=2 b=2 c=2 f=2 
fullPrefix~~~: ea
lastPrefix: a, postFix: (_b) (df) c b
lastPrefix: a, postFix: (_f) c b c


Input Sequence:(_b) (df) c b
Input Sequence:(_f) c b c
frequence: b=2 _b=1 c=2 d=1 f=1 _f=1 
support: b=2 c=2 
fullPrefix~~~: eab
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:
Input Sequence:c
frequence: c=1 
support: 
fullPrefix~~~: eac
lastPrefix: c, postFix: b
lastPrefix: c, postFix: b c


Input Sequence:b
Input Sequence:b c
frequence: b=2 c=1 
support: b=2 
fullPrefix~~~: eacb
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:
Input Sequence:c
frequence: c=1 
support: 
fullPrefix~~~: eb
lastPrefix: b, postFix: (df) c b
lastPrefix: b, postFix: c


Input Sequence:(df) c b
Input Sequence:c
frequence: b=1 c=2 d=1 f=1 
support: c=2 
fullPrefix~~~: ebc
lastPrefix: c, postFix: b
lastPrefix: c, postFix: 


Input Sequence:b
Input Sequence:
frequence: b=1 
support: 
fullPrefix~~~: ec
lastPrefix: c, postFix: b
lastPrefix: c, postFix: b c


Input Sequence:b
Input Sequence:b c
frequence: b=2 c=1 
support: b=2 
fullPrefix~~~: ecb
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:
Input Sequence:c
frequence: c=1 
support: 
fullPrefix~~~: ef
lastPrefix: f, postFix: c b
lastPrefix: f, postFix: c b c


Input Sequence:c b
Input Sequence:c b c
frequence: b=2 c=2 
support: b=2 c=2 
fullPrefix~~~: efb
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:
Input Sequence:c
frequence: c=1 
support: 
fullPrefix~~~: efc
lastPrefix: c, postFix: b
lastPrefix: c, postFix: b c


Input Sequence:b
Input Sequence:b c
frequence: b=2 c=1 
support: b=2 
fullPrefix~~~: efcb
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:
Input Sequence:c
frequence: c=1 
support: 
fullPrefix~~~: f
lastPrefix: f, postFix: 
lastPrefix: f, postFix: (ab) (df) c b
lastPrefix: f, postFix: c b c


Input Sequence:
Input Sequence:(ab) (df) c b
Input Sequence:c b c
frequence: a=1 b=2 c=2 d=1 f=1 
support: b=2 c=2 
fullPrefix~~~: fb
lastPrefix: b, postFix: (df) c b
lastPrefix: b, postFix: c


Input Sequence:(df) c b
Input Sequence:c
frequence: b=1 c=2 d=1 f=1 
support: c=2 
fullPrefix~~~: fbc
lastPrefix: c, postFix: b
lastPrefix: c, postFix: 


Input Sequence:b
Input Sequence:
frequence: b=1 
support: 
fullPrefix~~~: fc
lastPrefix: c, postFix: b
lastPrefix: c, postFix: b c


Input Sequence:b
Input Sequence:b c
frequence: b=2 c=1 
support: b=2 
fullPrefix~~~: fcb
lastPrefix: b, postFix: 
lastPrefix: b, postFix: c


Input Sequence:
Input Sequence:c
frequence: c=1 
support: 
fullPrefixDb:
a, aa, ab, aba, abc, a(bc), a(bc)a, (ab), (ab)c, (ab)d, (ab)dc, (ab)f, ac, aca, acb, acc, ad, adc, af, b, ba, bc, (bc), (bc)a, bd, bdc, bf, c, ca, cb, cc, d, db, dc, dcb, e, ea, eab, eac, eacb, eb, ebc, ec, ecb, ef, efb, efc, efcb, f, fb, fbc, fc, fcb, 


0 0