FP-Tree频繁模式树算法

来源：互联网发布：二次元动态桌面软件编辑：程序博客网时间：2024/05/16 12:34

参考资料：http://blog.csdn.net/sealyao/article/details/6460578
更多数据挖掘算法：https://github.com/linyiqun/DataMiningAlgorithm

介绍

FP-Tree算法全称是FrequentPattern Tree算法，就是频繁模式树算法，他与Apriori算法一样也是用来挖掘频繁项集的，不过不同的是，FP-Tree算法是Apriori算法的优化处理，他解决了Apriori算法在过程中会产生大量的候选集的问题，而FP-Tree算法则是发现频繁模式而不产生候选集。但是频繁模式挖掘出来后，产生关联规则的步骤还是和Apriori是一样的。

算法原理

FP树，FP树，那他当然是最终被构造成一个树的形状了。所以步骤如下：

1、创建根节点，用NULL标记。

2、统计所有的事务数据，统计事务中各个类型项的总支持度(在下面的例子中就是各个商品ID的总个数)

3、依次读取每条事务，比如T1， 1， 2， 5，因为按照总支持度计数数量降序排列，输入的数据顺序就是2， 1， 5，然后挂到根节点上。

4、依次读取后面的事务，并以同样的方式加入的FP树中，顺着根节点路径添加，并更新节点上的支持度计数。

最后就会形成这样的一棵树：

然后还要新建一个项头表，代表所有节点的类型和支持度计数。这个东西在后面会有大用处。如果你以为FP树的算法过程到这里就结束了，你就大错特错了，算法的终结过程为最后的FP树只包括但路径，就是树呈现直线形式，也就是节点都只有1个孩子或没有孩子，顺着一条线下来，没有其他的分支。这就算是一条挖掘出的频繁模式。所以上面的算法还要继续递归的构造FP树，递归构造FP树的过程：

1、这时我们从最下面的I5开始取出。把I5加入到后缀模式中。后缀模式到时会于频繁模式组合出现构成最终的频繁模式。

2、获取频繁模式基，<I2, Ii>，<I2, I1, I3>，计数为I5节点的count值，然后以这2条件模式基为输入的事务，继续构造一个新的FP树

绘图1

3、这就是我们要达到的FP树单路径的目标了，不过这里个要求，要把支持度计数不够的点排除，这里的I3:1就不符号，所以最后I5后缀模式下的<I2, I1>与I5的组合模式了，就为<I2, I5>, <I1, I5>,<I1, I2, I5>。

I5下的挖掘频繁模式是比较简单的，没有出现递归，看一下I3下的递归构造，这就不简单了，同样的操作，最后就会出现下面这幅图的样子：

发现还不是单条路径，继续递归构造，此时的后缀模式硬卧I3+I1,就是<I3, I1>，然后就来到了下面这幅图的情形了。

绘图2

后面的例子会有更详细的说明。

算法的实现

输入数据如下：

交易ID

商品ID列表

T100

I1，I2，I5

T200

I2，I4

T300

I2，I3

T400

I1，I2，I4

T500

I1，I3

T600

I2，I3

T700

I1，I3

T800

I1，I2，I3，I5

T900

I1，I2，I3

在文件中的形式就是：

[java] view plain copy
 print?
T1 1 2 5    
T2 2 4    
T3 2 3    
T4 1 2 4    
T5 1 3    
T6 2 3    
T7 1 3    
T8 1 2 3 5    
T9 1 2 3   

算法的树节点类：

[java] view plain copy
 print?
/** 
 * FP树节点 
 *  
 * @author lyq 
 *  
 */  
public class TreeNode implements Comparable<TreeNode>, Cloneable{  
    // 节点类别名称  
    private String name;  
    // 计数数量  
    private Integer count;  
    // 父亲节点  
    private TreeNode parentNode;  
    // 孩子节点，可以为多个  
    private ArrayList<TreeNode> childNodes;  
      
    public TreeNode(String name, int count){  
        this.name = name;  
        this.count = count;  
    }  
  
    public String getName() {  
        return name;  
    }  
  
    public void setName(String name) {  
        this.name = name;  
    }  
  
    public Integer getCount() {  
        return count;  
    }  
  
    public void setCount(Integer count) {  
        this.count = count;  
    }  
  
    public TreeNode getParentNode() {  
        return parentNode;  
    }  
  
    public void setParentNode(TreeNode parentNode) {  
        this.parentNode = parentNode;  
    }  
  
    public ArrayList<TreeNode> getChildNodes() {  
        return childNodes;  
    }  
  
    public void setChildNodes(ArrayList<TreeNode> childNodes) {  
        this.childNodes = childNodes;  
    }  
  
    @Override  
    public int compareTo(TreeNode o) {  
        // TODO Auto-generated method stub  
        return o.getCount().compareTo(this.getCount());  
    }  
  
    @Override  
    protected Object clone() throws CloneNotSupportedException {  
        // TODO Auto-generated method stub  
        //因为对象内部有引用，需要采用深拷贝  
        TreeNode node = (TreeNode)super.clone();   
        if(this.getParentNode() != null){  
            node.setParentNode((TreeNode) this.getParentNode().clone());  
        }  
          
        if(this.getChildNodes() != null){  
            node.setChildNodes((ArrayList<TreeNode>) this.getChildNodes().clone());  
        }  
          
        return node;  
    }  
      
}  

算法主要实现类：

[java] view plain copy
 print?
package DataMining_FPTree;  
  
import java.io.BufferedReader;  
import java.io.File;  
import java.io.FileReader;  
import java.io.IOException;  
import java.util.ArrayList;  
import java.util.Collections;  
import java.util.HashMap;  
import java.util.Map;  
  
/** 
 * FPTree算法工具类 
 *  
 * @author lyq 
 *  
 */  
public class FPTreeTool {  
    // 输入数据文件位置  
    private String filePath;  
    // 最小支持度阈值  
    private int minSupportCount;  
    // 所有事物ID记录  
    private ArrayList<String[]> totalGoodsID;  
    // 各个ID的统计数目映射表项，计数用于排序使用  
    private HashMap<String, Integer> itemCountMap;  
  
    public FPTreeTool(String filePath, int minSupportCount) {  
        this.filePath = filePath;  
        this.minSupportCount = minSupportCount;  
        readDataFile();  
    }  
  
    /** 
     * 从文件中读取数据 
     */  
    private void readDataFile() {  
        File file = new File(filePath);  
        ArrayList<String[]> dataArray = new ArrayList<String[]>();  
  
        try {  
            BufferedReader in = new BufferedReader(new FileReader(file));  
            String str;  
            String[] tempArray;  
            while ((str = in.readLine()) != null) {  
                tempArray = str.split(" ");  
                dataArray.add(tempArray);  
            }  
            in.close();  
        } catch (IOException e) {  
            e.getStackTrace();  
        }  
  
        String[] temp;  
        int count = 0;  
        itemCountMap = new HashMap<>();  
        totalGoodsID = new ArrayList<>();  
        for (String[] a : dataArray) {  
            temp = new String[a.length - 1];  
            System.arraycopy(a, 1, temp, 0, a.length - 1);  
            totalGoodsID.add(temp);  
            for (String s : temp) {  
                if (!itemCountMap.containsKey(s)) {  
                    count = 1;  
                } else {  
                    count = ((int) itemCountMap.get(s));  
                    // 支持度计数加1  
                    count++;  
                }  
                // 更新表项  
                itemCountMap.put(s, count);  
            }  
        }  
    }  
  
    /** 
     * 根据事物记录构造FP树 
     */  
    private void buildFPTree(ArrayList<String> suffixPattern,  
            ArrayList<ArrayList<TreeNode>> transctionList) {  
        // 设置一个空根节点  
        TreeNode rootNode = new TreeNode(null, 0);  
        int count = 0;  
        // 节点是否存在  
        boolean isExist = false;  
        ArrayList<TreeNode> childNodes;  
        ArrayList<TreeNode> pathList;  
        // 相同类型节点链表，用于构造的新的FP树  
        HashMap<String, ArrayList<TreeNode>> linkedNode = new HashMap<>();  
        HashMap<String, Integer> countNode = new HashMap<>();  
        // 根据事物记录，一步步构建FP树  
        for (ArrayList<TreeNode> array : transctionList) {  
            TreeNode searchedNode;  
            pathList = new ArrayList<>();  
            for (TreeNode node : array) {  
                pathList.add(node);  
                nodeCounted(node, countNode);  
                searchedNode = searchNode(rootNode, pathList);  
                childNodes = searchedNode.getChildNodes();  
  
                if (childNodes == null) {  
                    childNodes = new ArrayList<>();  
                    childNodes.add(node);  
                    searchedNode.setChildNodes(childNodes);  
                    node.setParentNode(searchedNode);  
                    nodeAddToLinkedList(node, linkedNode);  
                } else {  
                    isExist = false;  
                    for (TreeNode node2 : childNodes) {  
                        // 如果找到名称相同，则更新支持度计数  
                        if (node.getName().equals(node2.getName())) {  
                            count = node2.getCount() + node.getCount();  
                            node2.setCount(count);  
                            // 标识已找到节点位置  
                            isExist = true;  
                            break;  
                        }  
                    }  
  
                    if (!isExist) {  
                        // 如果没有找到，需添加子节点  
                        childNodes.add(node);  
                        node.setParentNode(searchedNode);  
                        nodeAddToLinkedList(node, linkedNode);  
                    }  
                }  
  
            }  
        }  
  
        // 如果FP树已经是单条路径，则输出此时的频繁模式  
        if (isSinglePath(rootNode)) {  
            printFrequentPattern(suffixPattern, rootNode);  
            System.out.println("-------");  
        } else {  
            ArrayList<ArrayList<TreeNode>> tList;  
            ArrayList<String> sPattern;  
            if (suffixPattern == null) {  
                sPattern = new ArrayList<>();  
            } else {  
                // 进行一个拷贝，避免互相引用的影响  
                sPattern = (ArrayList<String>) suffixPattern.clone();  
            }  
  
            // 利用节点链表构造新的事务  
            for (Map.Entry entry : countNode.entrySet()) {  
                // 添加到后缀模式中  
                sPattern.add((String) entry.getKey());  
                //获取到了条件模式机，作为新的事务  
                tList = getTransactionList((String) entry.getKey(), linkedNode);  
                  
                System.out.print("[后缀模式]：{");  
                for(String s: sPattern){  
                    System.out.print(s + ", ");  
                }  
                System.out.print("}, 此时的条件模式基：");  
                for(ArrayList<TreeNode> tnList: tList){  
                    System.out.print("{");  
                    for(TreeNode n: tnList){  
                        System.out.print(n.getName() + ", ");  
                    }  
                    System.out.print("}, ");  
                }  
                System.out.println();  
                // 递归构造FP树  
                buildFPTree(sPattern, tList);  
                // 再次移除此项，构造不同的后缀模式，防止对后面造成干扰  
                sPattern.remove((String) entry.getKey());  
            }  
        }  
    }  
  
    /** 
     * 将节点加入到同类型节点的链表中 
     *  
     * @param node 
     *            待加入节点 
     * @param linkedList 
     *            链表图 
     */  
    private void nodeAddToLinkedList(TreeNode node,  
            HashMap<String, ArrayList<TreeNode>> linkedList) {  
        String name = node.getName();  
        ArrayList<TreeNode> list;  
  
        if (linkedList.containsKey(name)) {  
            list = linkedList.get(name);  
            // 将node添加到此队列中  
            list.add(node);  
        } else {  
            list = new ArrayList<>();  
            list.add(node);  
            linkedList.put(name, list);  
        }  
    }  
  
    /** 
     * 根据链表构造出新的事务 
     *  
     * @param name 
     *            节点名称 
     * @param linkedList 
     *            链表 
     * @return 
     */  
    private ArrayList<ArrayList<TreeNode>> getTransactionList(String name,  
            HashMap<String, ArrayList<TreeNode>> linkedList) {  
        ArrayList<ArrayList<TreeNode>> tList = new ArrayList<>();  
        ArrayList<TreeNode> targetNode = linkedList.get(name);  
        ArrayList<TreeNode> singleTansaction;  
        TreeNode temp;  
  
        for (TreeNode node : targetNode) {  
            singleTansaction = new ArrayList<>();  
  
            temp = node;  
            while (temp.getParentNode().getName() != null) {  
                temp = temp.getParentNode();  
                singleTansaction.add(new TreeNode(temp.getName(), 1));  
            }  
  
            // 按照支持度计数得反转一下  
            Collections.reverse(singleTansaction);  
  
            for (TreeNode node2 : singleTansaction) {  
                // 支持度计数调成与模式后缀一样  
                node2.setCount(node.getCount());  
            }  
  
            if (singleTansaction.size() > 0) {  
                tList.add(singleTansaction);  
            }  
        }  
  
        return tList;  
    }  
  
    /** 
     * 节点计数 
     *  
     * @param node 
     *            待加入节点 
     * @param nodeCount 
     *            计数映射图 
     */  
    private void nodeCounted(TreeNode node, HashMap<String, Integer> nodeCount) {  
        int count = 0;  
        String name = node.getName();  
  
        if (nodeCount.containsKey(name)) {  
            count = nodeCount.get(name);  
            count++;  
        } else {  
            count = 1;  
        }  
  
        nodeCount.put(name, count);  
    }  
  
    /** 
     * 显示决策树 
     *  
     * @param node 
     *            待显示的节点 
     * @param blankNum 
     *            行空格符，用于显示树型结构 
     */  
    private void showFPTree(TreeNode node, int blankNum) {  
        System.out.println();  
        for (int i = 0; i < blankNum; i++) {  
            System.out.print("\t");  
        }  
        System.out.print("--");  
        System.out.print("--");  
  
        if (node.getChildNodes() == null) {  
            System.out.print("[");  
            System.out.print("I" + node.getName() + ":" + node.getCount());  
            System.out.print("]");  
        } else {  
            // 递归显示子节点  
            // System.out.print("【" + node.getName() + "】");  
            for (TreeNode childNode : node.getChildNodes()) {  
                showFPTree(childNode, 2 * blankNum);  
            }  
        }  
  
    }  
  
    /** 
     * 待插入节点的抵达位置节点，从根节点开始向下寻找待插入节点的位置 
     *  
     * @param root 
     * @param list 
     * @return 
     */  
    private TreeNode searchNode(TreeNode node, ArrayList<TreeNode> list) {  
        ArrayList<TreeNode> pathList = new ArrayList<>();  
        TreeNode tempNode = null;  
        TreeNode firstNode = list.get(0);  
        boolean isExist = false;  
        // 重新转一遍，避免出现同一引用  
        for (TreeNode node2 : list) {  
            pathList.add(node2);  
        }  
  
        // 如果没有孩子节点，则直接返回，在此节点下添加子节点  
        if (node.getChildNodes() == null) {  
            return node;  
        }  
  
        for (TreeNode n : node.getChildNodes()) {  
            if (n.getName().equals(firstNode.getName()) && list.size() == 1) {  
                tempNode = node;  
                isExist = true;  
                break;  
            } else if (n.getName().equals(firstNode.getName())) {  
                // 还没有找到最后的位置，继续找  
                pathList.remove(firstNode);  
                tempNode = searchNode(n, pathList);  
                return tempNode;  
            }  
        }  
  
        // 如果没有找到，则新添加到孩子节点中  
        if (!isExist) {  
            tempNode = node;  
        }  
  
        return tempNode;  
    }  
  
    /** 
     * 判断目前构造的FP树是否是单条路径的 
     *  
     * @param rootNode 
     *            当前FP树的根节点 
     * @return 
     */  
    private boolean isSinglePath(TreeNode rootNode) {  
        // 默认是单条路径  
        boolean isSinglePath = true;  
        ArrayList<TreeNode> childList;  
        TreeNode node;  
        node = rootNode;  
  
        while (node.getChildNodes() != null) {  
            childList = node.getChildNodes();  
            if (childList.size() == 1) {  
                node = childList.get(0);  
            } else {  
                isSinglePath = false;  
                break;  
            }  
        }  
  
        return isSinglePath;  
    }  
  
    /** 
     * 开始构建FP树 
     */  
    public void startBuildingTree() {  
        ArrayList<TreeNode> singleTransaction;  
        ArrayList<ArrayList<TreeNode>> transactionList = new ArrayList<>();  
        TreeNode tempNode;  
        int count = 0;  
  
        for (String[] idArray : totalGoodsID) {  
            singleTransaction = new ArrayList<>();  
            for (String id : idArray) {  
                count = itemCountMap.get(id);  
                tempNode = new TreeNode(id, count);  
                singleTransaction.add(tempNode);  
            }  
  
            // 根据支持度数的多少进行排序  
            Collections.sort(singleTransaction);  
            for (TreeNode node : singleTransaction) {  
                // 支持度计数重新归为1  
                node.setCount(1);  
            }  
            transactionList.add(singleTransaction);  
        }  
  
        buildFPTree(null, transactionList);  
    }  
  
    /** 
     * 输出此单条路径下的频繁模式 
     *  
     * @param suffixPattern 
     *            后缀模式 
     * @param rootNode 
     *            单条路径FP树根节点 
     */  
    private void printFrequentPattern(ArrayList<String> suffixPattern,  
            TreeNode rootNode) {  
        ArrayList<String> idArray = new ArrayList<>();  
        TreeNode temp;  
        temp = rootNode;  
        // 用于输出组合模式  
        int length = 0;  
        int num = 0;  
        int[] binaryArray;  
  
        while (temp.getChildNodes() != null) {  
            temp = temp.getChildNodes().get(0);  
  
            // 筛选支持度系数大于最小阈值的值  
            if (temp.getCount() >= minSupportCount) {  
                idArray.add(temp.getName());  
            }  
        }  
  
        length = idArray.size();  
        num = (int) Math.pow(2, length);  
        for (int i = 0; i < num; i++) {  
            binaryArray = new int[length];  
            numToBinaryArray(binaryArray, i);  
  
            // 如果后缀模式只有1个，不能输出自身  
            if (suffixPattern.size() == 1 && i == 0) {  
                continue;  
            }  
  
            System.out.print("频繁模式：{【后缀模式：");  
            // 先输出固有的后缀模式  
            if (suffixPattern.size() > 1  
                    || (suffixPattern.size() == 1 && idArray.size() > 0)) {  
                for (String s : suffixPattern) {  
                    System.out.print(s + ", ");  
                }  
            }  
            System.out.print("】");  
            // 输出路径上的组合模式  
            for (int j = 0; j < length; j++) {  
                if (binaryArray[j] == 1) {  
                    System.out.print(idArray.get(j) + ", ");  
                }  
            }  
            System.out.println("}");  
        }  
    }  
  
    /** 
     * 数字转为二进制形式 
     *  
     * @param binaryArray 
     *            转化后的二进制数组形式 
     * @param num 
     *            待转化数字 
     */  
    private void numToBinaryArray(int[] binaryArray, int num) {  
        int index = 0;  
        while (num != 0) {  
            binaryArray[index] = num % 2;  
            index++;  
            num /= 2;  
        }  
    }  
  
}  

算法调用测试类：

[java] view plain copy
 print?
/** 
 * FPTree频繁模式树算法 
 * @author lyq 
 * 
 */  
public class Client {  
    public static void main(String[] args){  
        String filePath = "C:\\Users\\lyq\\Desktop\\icon\\testInput.txt";  
        //最小支持度阈值  
        int minSupportCount = 2;  
          
        FPTreeTool tool = new FPTreeTool(filePath, minSupportCount);  
        tool.startBuildingTree();  
    }  
}  

输出的结果为：

[java] view plain copy
 print?
[后缀模式]：{3, }, 此时的条件模式基：{2, }, {1, }, {2, 1, },   
[后缀模式]：{3, 2, }, 此时的条件模式基：  
频繁模式：{【后缀模式：3, 2, 】}  
-------  
[后缀模式]：{3, 1, }, 此时的条件模式基：{2, },   
频繁模式：{【后缀模式：3, 1, 】}  
频繁模式：{【后缀模式：3, 1, 】2, }  
-------  
[后缀模式]：{2, }, 此时的条件模式基：  
-------  
[后缀模式]：{1, }, 此时的条件模式基：{2, },   
频繁模式：{【后缀模式：1, 】2, }  
-------  
[后缀模式]：{5, }, 此时的条件模式基：{2, 1, }, {2, 1, 3, },   
频繁模式：{【后缀模式：5, 】2, }  
频繁模式：{【后缀模式：5, 】1, }  
频繁模式：{【后缀模式：5, 】2, 1, }  
-------  
[后缀模式]：{4, }, 此时的条件模式基：{2, }, {2, 1, },   
频繁模式：{【后缀模式：4, 】2, }  
-------  

读者可以自己手动的构造一下，可以更深的理解这个过程，然后对照本人的代码做对比。

算法编码时的难点

1、在构造树的时候要重新构建一棵树的时候，要不能对原来的树做更改，在此期间用了老的树的对象，又造成了重复引用的问题了，于是果断又new了一个TreeNode，只把原树的name，和count值拿了过来，父子节点关系完全重新构造。

2、在事务生产树的过程中，把事务映射到TreeNode数组中，然后过程就是加Node节点或者更新Node节点的count值，过程简单许多，也许会让人很难理解，应该个人感觉这样比较方便，如果是死板的String[]字符串数组的形式，中间还要与TreeNode各种转化非常麻烦。

3、在计算条件模式基的时候，我是存在了HashMap<String, ArrayList<TreeNode>>map中，并并没有搞成链表的形式，直接在生成树的时候就全部统计好。

4、此处算法用了2处递归，一个地方是在添加树节点的时候，搜索要在哪个node上做添加的方法，searchNode(TreeNode node, ArrayList<TreeNode> list)，还有一个是整个的buildFPTree()算法，都不是能够一眼就能看明白的地方。希望大家能够理解我的用意。

FP-Tree算法的缺点

尽管FP-Tree算法在挖掘频繁模式的过程中相较Apriori算法里没有产生候选集了，比Apriori也快了一个数量级上了，但是整体上FP-Tree算法的时间，空间消耗开销上还是挺大的。

0 0