GSP序列模式分析算法

来源：互联网发布：淘宝网店店名怎么改编辑：程序博客网时间：2024/06/06 07:30

参考资料：http://blog.csdn.net/zone_programming/article/details/42032309

更多数据挖掘代码：https://github.com/linyiqun/DataMiningAlgorithm

介绍

GSP算法是序列模式挖掘算法的一种，他是一种类Apriori的一种，整个过程与Apriori算法比较类似，不过在细节上会略有不同，在下面的描述中，将会有所描述。GSP在原有的频繁模式定义的概念下，增加了3个的概念。

1、加入时间约束min_gap，max_gap，要求原来的连续变为只要满足在规定的min_gap到max_gap之间即可。

2、加入time_windows_size，只要在windows_size内的item，都可以被认为是同一ItemSet。

3、加入分类标准。

以上3点新的中的第一条特征将会在后面的算法中着重展现。

算法原理

1、根据所输入的序列，找出所有的单项集，即1频繁模式，这里会经过最小支持度阈值的判断。

2、根据1频繁模式进行连接运算，产生2频繁模式，这里会有进行最小阈值的判断。

3、根据2频繁模式连接产生3频繁模式，会经过最小支持度判断和剪枝操作，剪枝操作的原理在于判断他的所有子集是否也全是频繁模式。

4、3频繁模式不断的挖掘知道不能够产生出候选集为止。

连接操作的原理

2个序列，全部变为item列表的形式，如果a序列去掉第1个元素后，b序列去掉最后1个序列，2个序列的item完全一致，则代表可以连接，由b的最后一个元素加入到a中，至于是以独立项集的身份加入还是加入到a中最后1个项集中取决于b中的最后一个元素所属项集是否为单项项集。

时间约束计算

这个是用在支持度计数使用的，GSP算法的支持度计算不是那么简单，比如序列判断<2, <3, 4>>是否在序列<(1,5), 2 , <3, 4>, 2>，这就不能仅仅判断序列中是否只包含2,<3, 4>就行了，还要满足时间间隔约束，这就要把2，和<3,4>的所有出现时间都找出来，然后再里面找出一条满足时间约束的路径就算包含。时间的定义是从左往右起1.2,3...继续，以1个项集为单位，所有2的时间有2个分别为t=2和t=4，然后同理，因为<3,4>在序列中只有1次，所以时间为t=3，所以问题就变为了下面一个数组的问题

2 4

从时间数组的上往下，通过对多个时间的组合，找出1条满足时间约束的方案，这里的方案只有2-3,4-3,然后判断时间间隔，如果存在这样的方式，则代表此序列支持所给定序列，支持度值加1,这个算法在程序的实现中是比较复杂的。

算法的代码实现

测试数据输入(格式：事务ID item数 item1 item2.....):

1 2 1 51 1 21 1 31 1 42 1 12 1 32 1 42 2 3 53 1 13 1 23 1 33 1 43 1 54 1 14 1 34 1 55 1 45 1 5

最后组成的序列为：

<(1,5) 2 3 4>

<1 3 4 (3,5)>

<1 2 3 4 5>

<1 3 5>

<4 5>

也就是说同一序列都是同事务的。下面是关键的类

Sequence.java:

package DataMining_GSP;import java.util.ArrayList;/** * 序列，每个序列内部包含多组ItemSet项集 *  * @author lyq *  */public class Sequence implements Comparable<Sequence>, Cloneable {// 序列所属事务IDprivate int trsanctionID;// 项集列表private ArrayList<ItemSet> itemSetList;public Sequence(int trsanctionID) {this.trsanctionID = trsanctionID;this.itemSetList = new ArrayList<>();}public Sequence() {this.itemSetList = new ArrayList<>();}public int getTrsanctionID() {return trsanctionID;}public void setTrsanctionID(int trsanctionID) {this.trsanctionID = trsanctionID;}public ArrayList<ItemSet> getItemSetList() {return itemSetList;}public void setItemSetList(ArrayList<ItemSet> itemSetList) {this.itemSetList = itemSetList;}/** * 取出序列中第一个项集的第一个元素 *  * @return */public Integer getFirstItemSetNum() {return this.getItemSetList().get(0).getItems().get(0);}/** * 获取序列中最后一个项集 *  * @return */public ItemSet getLastItemSet() {return getItemSetList().get(getItemSetList().size() - 1);}/** * 获取序列中最后一个项集的最后一个一个元素 *  * @return */public Integer getLastItemSetNum() {ItemSet lastItemSet = getItemSetList().get(getItemSetList().size() - 1);int lastItemNum = lastItemSet.getItems().get(lastItemSet.getItems().size() - 1);return lastItemNum;}/** * 判断序列中最后一个项集是否为单一的值 *  * @return */public boolean isLastItemSetSingleNum() {ItemSet lastItemSet = getItemSetList().get(getItemSetList().size() - 1);int size = lastItemSet.getItems().size();return size == 1 ? true : false;}@Overridepublic int compareTo(Sequence o) {// TODO Auto-generated method stubreturn this.getFirstItemSetNum().compareTo(o.getFirstItemSetNum());}@Overrideprotected Object clone() throws CloneNotSupportedException {// TODO Auto-generated method stubreturn super.clone();}/** * 拷贝一份一模一样的序列 */public Sequence copySeqence(){Sequence copySeq = new Sequence();for(ItemSet itemSet: this.itemSetList){copySeq.getItemSetList().add(new ItemSet(itemSet.copyItems()));}return copySeq;}/** * 比较2个序列是否相等，需要判断内部的每个项集是否完全一致 *  * @param seq *            比较的序列对象 * @return */public boolean compareIsSame(Sequence seq) {boolean result = true;ArrayList<ItemSet> itemSetList2 = seq.getItemSetList();ItemSet tempItemSet1;ItemSet tempItemSet2;if (itemSetList2.size() != this.itemSetList.size()) {return false;}for (int i = 0; i < itemSetList2.size(); i++) {tempItemSet1 = this.itemSetList.get(i);tempItemSet2 = itemSetList2.get(i);if (!tempItemSet1.compareIsSame(tempItemSet2)) {// 只要不相等，直接退出函数result = false;break;}}return result;}/** * 生成此序列的所有子序列 *  * @return */public ArrayList<Sequence> createChildSeqs() {ArrayList<Sequence> childSeqs = new ArrayList<>();ArrayList<Integer> tempItems;Sequence tempSeq = null;ItemSet tempItemSet;for (int i = 0; i < this.itemSetList.size(); i++) {tempItemSet = itemSetList.get(i);if (tempItemSet.getItems().size() == 1) {tempSeq = this.copySeqence();// 如果只有项集中只有1个元素，则直接移除tempSeq.itemSetList.remove(i);childSeqs.add(tempSeq);} else {tempItems = tempItemSet.getItems();for (int j = 0; j < tempItems.size(); j++) {tempSeq = this.copySeqence();// 在拷贝的序列中移除一个数字tempSeq.getItemSetList().get(i).getItems().remove(j);childSeqs.add(tempSeq);}}}return childSeqs;}}

ItemSet.java:

package DataMining_GSP;import java.util.ArrayList;/** * 序列中的子项集 *  * @author lyq *  */public class ItemSet {/** * 项集中保存的是数字项数组 */private ArrayList<Integer> items;public ItemSet(String[] itemStr) {items = new ArrayList<>();for (String s : itemStr) {items.add(Integer.parseInt(s));}}public ItemSet(int[] itemNum) {items = new ArrayList<>();for (int num : itemNum) {items.add(num);}}public ItemSet(ArrayList<Integer> itemNum) {this.items = itemNum;}public ArrayList<Integer> getItems() {return items;}public void setItems(ArrayList<Integer> items) {this.items = items;}/** * 判断2个项集是否相等 *  * @param itemSet *            比较对象 * @return */public boolean compareIsSame(ItemSet itemSet) {boolean result = true;if (this.items.size() != itemSet.items.size()) {return false;}for (int i = 0; i < itemSet.items.size(); i++) {if (this.items.get(i) != itemSet.items.get(i)) {// 只要有值不相等，直接算作不相等result = false;break;}}return result;}/** * 拷贝项集中同样的数据一份 *  * @return */public ArrayList<Integer> copyItems() {ArrayList<Integer> copyItems = new ArrayList<>();for (int num : this.items) {copyItems.add(num);}return copyItems;}}

GSPTool.java(算法工具类)：

package DataMining_GSP;import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.IOException;import java.util.ArrayList;import java.util.Collections;import java.util.HashMap;import java.util.Map;/** * GSP序列模式分析算法 *  * @author lyq *  */public class GSPTool {// 测试数据文件地址private String filePath;// 最小支持度阈值private int minSupportCount;// 时间最小间隔private int min_gap;// 时间最大间隔private int max_gap;// 原始数据序列private ArrayList<Sequence> totalSequences;// GSP算法中产生的所有的频繁项集序列private ArrayList<Sequence> totalFrequencySeqs;// 序列项数字对时间的映射图容器private ArrayList<ArrayList<HashMap<Integer, Integer>>> itemNum2Time;public GSPTool(String filePath, int minSupportCount, int min_gap,int max_gap) {this.filePath = filePath;this.minSupportCount = minSupportCount;this.min_gap = min_gap;this.max_gap = max_gap;totalFrequencySeqs = new ArrayList<>();readDataFile();}/** * 从文件中读取数据 */private void readDataFile() {File file = new File(filePath);ArrayList<String[]> dataArray = new ArrayList<String[]>();try {BufferedReader in = new BufferedReader(new FileReader(file));String str;String[] tempArray;while ((str = in.readLine()) != null) {tempArray = str.split(" ");dataArray.add(tempArray);}in.close();} catch (IOException e) {e.getStackTrace();}HashMap<Integer, Sequence> mapSeq = new HashMap<>();Sequence seq;ItemSet itemSet;int tID;String[] itemStr;for (String[] str : dataArray) {tID = Integer.parseInt(str[0]);itemStr = new String[Integer.parseInt(str[1])];System.arraycopy(str, 2, itemStr, 0, itemStr.length);itemSet = new ItemSet(itemStr);if (mapSeq.containsKey(tID)) {seq = mapSeq.get(tID);} else {seq = new Sequence(tID);}seq.getItemSetList().add(itemSet);mapSeq.put(tID, seq);}// 将序列图加入到序列List中totalSequences = new ArrayList<>();for (Map.Entry entry : mapSeq.entrySet()) {totalSequences.add((Sequence) entry.getValue());}}/** * 生成1频繁项集 *  * @return */private ArrayList<Sequence> generateOneFrequencyItem() {int count = 0;int currentTransanctionID = 0;Sequence tempSeq;ItemSet tempItemSet;HashMap<Integer, Integer> itemNumMap = new HashMap<>();ArrayList<Sequence> seqList = new ArrayList<>();for (Sequence seq : totalSequences) {for (ItemSet itemSet : seq.getItemSetList()) {for (int num : itemSet.getItems()) {// 如果没有此种类型项，则进行添加操作if (!itemNumMap.containsKey(num)) {itemNumMap.put(num, 1);}}}}boolean isContain = false;int number = 0;for (Map.Entry entry : itemNumMap.entrySet()) {count = 0;number = (int) entry.getKey();for (Sequence seq : totalSequences) {isContain = false;for (ItemSet itemSet : seq.getItemSetList()) {for (int num : itemSet.getItems()) {if (num == number) {isContain = true;break;}}if(isContain){break;}}if(isContain){count++;}}itemNumMap.put(number, count);}for (Map.Entry entry : itemNumMap.entrySet()) {count = (int) entry.getValue();if (count >= minSupportCount) {tempSeq = new Sequence();tempItemSet = new ItemSet(new int[] { (int) entry.getKey() });tempSeq.getItemSetList().add(tempItemSet);seqList.add(tempSeq);}}// 将序列升序排列Collections.sort(seqList);// 将频繁1项集加入总频繁项集列表中totalFrequencySeqs.addAll(seqList);return seqList;}/** * 通过1频繁项集连接产生2频繁项集 *  * @param oneSeq *            1频繁项集序列 * @return */private ArrayList<Sequence> generateTwoFrequencyItem(ArrayList<Sequence> oneSeq) {Sequence tempSeq;ArrayList<Sequence> resultSeq = new ArrayList<>();ItemSet tempItemSet;int num1;int num2;// 假如将<a>,<b>2个1频繁项集做连接组合，可以分为<a a>，<a b>，<b a>,<b b>4个序列模式// 注意此时的每个序列中包含2个独立项集for (int i = 0; i < oneSeq.size(); i++) {num1 = oneSeq.get(i).getFirstItemSetNum();for (int j = 0; j < oneSeq.size(); j++) {num2 = oneSeq.get(j).getFirstItemSetNum();tempSeq = new Sequence();tempItemSet = new ItemSet(new int[] { num1 });tempSeq.getItemSetList().add(tempItemSet);tempItemSet = new ItemSet(new int[] { num2 });tempSeq.getItemSetList().add(tempItemSet);if (countSupport(tempSeq) >= minSupportCount) {resultSeq.add(tempSeq);}}}// 上面连接还有1种情况是每个序列中只包含有一个项集的情况，此时a,b的划分则是<(a,a)> <(a,b)> <(b,b)>for (int i = 0; i < oneSeq.size(); i++) {num1 = oneSeq.get(i).getFirstItemSetNum();for (int j = i; j < oneSeq.size(); j++) {num2 = oneSeq.get(j).getFirstItemSetNum();tempSeq = new Sequence();tempItemSet = new ItemSet(new int[] { num1, num2 });tempSeq.getItemSetList().add(tempItemSet);if (countSupport(tempSeq) >= minSupportCount) {resultSeq.add(tempSeq);}}}// 同样将2频繁项集加入到总频繁项集中totalFrequencySeqs.addAll(resultSeq);return resultSeq;}/** * 根据上次的频繁集连接产生新的侯选集 *  * @param seqList *            上次产生的候选集 * @return */private ArrayList<Sequence> generateCandidateItem(ArrayList<Sequence> seqList) {Sequence tempSeq;ArrayList<Integer> tempNumArray;ArrayList<Sequence> resultSeq = new ArrayList<>();// 序列数字项列表ArrayList<ArrayList<Integer>> seqNums = new ArrayList<>();for (int i = 0; i < seqList.size(); i++) {tempNumArray = new ArrayList<>();tempSeq = seqList.get(i);for (ItemSet itemSet : tempSeq.getItemSetList()) {tempNumArray.addAll(itemSet.copyItems());}seqNums.add(tempNumArray);}ArrayList<Integer> array1;ArrayList<Integer> array2;// 序列i,j的拷贝Sequence seqi = null;Sequence seqj = null;// 判断是否能够连接，默认能连接boolean canConnect = true;// 进行连接运算，包括自己与自己连接for (int i = 0; i < seqNums.size(); i++) {for (int j = 0; j < seqNums.size(); j++) {array1 = (ArrayList<Integer>) seqNums.get(i).clone();array2 = (ArrayList<Integer>) seqNums.get(j).clone();// 将第一个数字组去掉第一个，第二个数字组去掉最后一个，如果剩下的部分相等，则可以连接array1.remove(0);array2.remove(array2.size() - 1);canConnect = true;for (int k = 0; k < array1.size(); k++) {if (array1.get(k) != array2.get(k)) {canConnect = false;break;}}if (canConnect) {seqi = seqList.get(i).copySeqence();seqj = seqList.get(j).copySeqence();int lastItemNum = seqj.getLastItemSetNum();if (seqj.isLastItemSetSingleNum()) {// 如果j序列的最后项集为单一值，则最后一个数字以独立项集加入i序列ItemSet itemSet = new ItemSet(new int[] { lastItemNum });seqi.getItemSetList().add(itemSet);} else {// 如果j序列的最后项集为非单一值，则最后一个数字加入i序列最后一个项集中ItemSet itemSet = seqi.getLastItemSet();itemSet.getItems().add(lastItemNum);}// 判断是否超过最小支持度阈值if (isChildSeqContained(seqi)&& countSupport(seqi) >= minSupportCount) {resultSeq.add(seqi);}}}}totalFrequencySeqs.addAll(resultSeq);return resultSeq;}/** * 判断此序列的所有子序列是否也是频繁序列 *  * @param seq *            待比较序列 * @return */private boolean isChildSeqContained(Sequence seq) {boolean isContained = false;ArrayList<Sequence> childSeqs;childSeqs = seq.createChildSeqs();for (Sequence tempSeq : childSeqs) {isContained = false;for (Sequence frequencySeq : totalFrequencySeqs) {if (tempSeq.compareIsSame(frequencySeq)) {isContained = true;break;}}if (!isContained) {break;}}return isContained;}/** * 候选集判断支持度的值 *  * @param seq *            待判断序列 * @return */private int countSupport(Sequence seq) {int count = 0;int matchNum = 0;Sequence tempSeq;ItemSet tempItemSet;HashMap<Integer, Integer> timeMap;ArrayList<ItemSet> itemSetList;ArrayList<ArrayList<Integer>> numArray = new ArrayList<>();// 每项集对应的时间链表ArrayList<ArrayList<Integer>> timeArray = new ArrayList<>();for (ItemSet itemSet : seq.getItemSetList()) {numArray.add(itemSet.getItems());}for (int i = 0; i < totalSequences.size(); i++) {timeArray = new ArrayList<>();for (int s = 0; s < numArray.size(); s++) {ArrayList<Integer> childNum = numArray.get(s);ArrayList<Integer> localTime = new ArrayList<>();tempSeq = totalSequences.get(i);itemSetList = tempSeq.getItemSetList();for (int j = 0; j < itemSetList.size(); j++) {tempItemSet = itemSetList.get(j);matchNum = 0;int t = 0;if (tempItemSet.getItems().size() == childNum.size()) {timeMap = itemNum2Time.get(i).get(j);// 只有当项集长度匹配时才匹配for (int k = 0; k < childNum.size(); k++) {if (timeMap.containsKey(childNum.get(k))) {matchNum++;t = timeMap.get(childNum.get(k));}}// 如果完全匹配，则记录时间if (matchNum == childNum.size()) {localTime.add(t);}}}if (localTime.size() > 0) {timeArray.add(localTime);}}// 判断时间是否满足时间最大最小约束，如果满足，则此条事务包含候选事务if (timeArray.size() == numArray.size()&& judgeTimeInGap(timeArray)) {count++;}}return count;}/** * 判断事务是否满足时间约束 *  * @param timeArray *            时间数组，每行代表各项集的在事务中的发生时间链表 * @return */private boolean judgeTimeInGap(ArrayList<ArrayList<Integer>> timeArray) {boolean result = false;int preTime = 0;ArrayList<Integer> firstTimes = timeArray.get(0);timeArray.remove(0);if (timeArray.size() == 0) {return false;}for (int i = 0; i < firstTimes.size(); i++) {preTime = firstTimes.get(i);if (dfsJudgeTime(preTime, timeArray)) {result = true;break;}}return result;}/** * 深度优先遍历时间，判断是否有符合条件的时间间隔 *  * @param preTime * @param timeArray * @return */private boolean dfsJudgeTime(int preTime,ArrayList<ArrayList<Integer>> timeArray) {boolean result = false;ArrayList<ArrayList<Integer>> timeArrayClone = (ArrayList<ArrayList<Integer>>) timeArray.clone();ArrayList<Integer> firstItemItem = timeArrayClone.get(0);for (int i = 0; i < firstItemItem.size(); i++) {if (firstItemItem.get(i) - preTime >= min_gap&& firstItemItem.get(i) - preTime <= max_gap) {// 如果此2项间隔时间满足时间约束，则继续往下递归preTime = firstItemItem.get(i);timeArrayClone.remove(0);if (timeArrayClone.size() == 0) {return true;} else {result = dfsJudgeTime(preTime, timeArrayClone);if (result) {return true;}}}}return result;}/** * 初始化序列项到时间的序列图，为了后面的时间约束计算 */private void initItemNumToTimeMap() {Sequence seq;itemNum2Time = new ArrayList<>();HashMap<Integer, Integer> tempMap;ArrayList<HashMap<Integer, Integer>> tempMapList;for (int i = 0; i < totalSequences.size(); i++) {seq = totalSequences.get(i);tempMapList = new ArrayList<>();for (int j = 0; j < seq.getItemSetList().size(); j++) {ItemSet itemSet = seq.getItemSetList().get(j);tempMap = new HashMap<>();for (int itemNum : itemSet.getItems()) {tempMap.put(itemNum, j + 1);}tempMapList.add(tempMap);}itemNum2Time.add(tempMapList);}}/** * 进行GSP算法计算 */public void gspCalculate() {ArrayList<Sequence> oneSeq;ArrayList<Sequence> twoSeq;ArrayList<Sequence> candidateSeq;initItemNumToTimeMap();oneSeq = generateOneFrequencyItem();twoSeq = generateTwoFrequencyItem(oneSeq);candidateSeq = twoSeq;// 不断连接生产候选集，直到没有产生出侯选集for (;;) {candidateSeq = generateCandidateItem(candidateSeq);if (candidateSeq.size() == 0) {break;}}outputSeqence(totalFrequencySeqs);}/** * 输出序列列表信息 *  * @param outputSeqList *            待输出序列列表 */private void outputSeqence(ArrayList<Sequence> outputSeqList) {for (Sequence seq : outputSeqList) {System.out.print("<");for (ItemSet itemSet : seq.getItemSetList()) {System.out.print("(");for (int num : itemSet.getItems()) {System.out.print(num + ",");}System.out.print("), ");}System.out.println(">");}}}

调用类Client.java:

package DataMining_GSP;/** * GSP序列模式分析算法 * @author lyq * */public class Client {public static void main(String[] args){String filePath = "C:\\Users\\lyq\\Desktop\\icon\\testInput.txt";//最小支持度阈值int minSupportCount = 2;//时间最小间隔int min_gap = 1;//施加最大间隔int max_gap = 5;GSPTool tool = new GSPTool(filePath, minSupportCount, min_gap, max_gap);tool.gspCalculate();}}

算法的输出(挖掘出的所有频繁模式)：

<(1,), ><(2,), ><(3,), ><(4,), ><(5,), ><(1,), (3,), ><(1,), (4,), ><(1,), (5,), ><(2,), (3,), ><(2,), (4,), ><(3,), (4,), ><(3,), (5,), ><(4,), (5,), ><(1,), (3,), (4,), ><(1,), (3,), (5,), ><(2,), (3,), (4,), >

算法实现的难点

1、算法花费了几天的时间，难点首先在于对算法原理本身的理解，网上对于此算法的资料特别少，而且不同的人所表达的意思都有少许的不同，讲的也不是很详细，于是就通过阅读别人的代码理解GSP算法的原理，我的代码实现也是参考了参考资料的C语言的实现。

2、在实现时间约束的支持度计数统计的时候，调试了一段时间，做时间统计容易出错，因为层级实在太多容易搞晕。

3、还有1个是Sequence和ItemSet的拷贝时的引用问题，在产生新的序列时一定要深拷贝1个否则导致同一引用会把原数据给改掉的。

GSP算法和Apriori算法的比较

我是都实现过了GSP算法和Apriori算法的，后者是被称为关联规则挖掘算法，偏向于挖掘关联规则的，2个算法在连接的操作上有不一样的地方，还有在数据的构成方式上，Apriori的数据会简单一点，都是单项单项构成的，而且在做支持度统计的时候只需判断存在与否即可。不需要考虑时间约束。Apriori算法给定K项集，连接到K-1项集算法就停止了，而GSP算法是直到不能够产生候选集为止。

3 0