huffman编码及解码实现

来源：互联网发布：mac系统绘画软件编辑：程序博客网时间：2024/05/04 21:59

Huffman编码就是利用每个字符出现频率的不一致，用长短不一的0、1字节来表示不同的字符以减少总数据大小。假设我们有一包含10000个字符的文件，这些字符仅由6个不同的字符组成，就设这6个字符分别为“abcdef”，下面的表给出了这6个字符在整个文件中的占比，和两种不同的编码方式。

abcdefFrequency (in thousands)4513121095Fixed-length codeword000001010011100101Variable-length codeword010110011111011100

上例中固定长度的编码方式最少需要三位。那么整个文件的长度大小为300,000 bits，而对于可变长度的编码方式其使用大小为：

(45 * 1 + 13 * 3 + 12 * 3 + 16 * 3 + 9 * 4 + 5 * 4) · 1,000 = 224,000 bits

使用第二种编码方式能比第一种方式节约大约25%的空间。上述变长编码的方式实际上是一种名为前缀编码的编码方式。

赫夫曼编码是指赫夫曼提供的一种构建最优前缀编码的方法。其方法是总选取权重最小的两个结点x和y合并成一个结点z,并用z代替它们，再从中选出两个权重最小的结点。如是反复。图解：

网页 www.cs.nyu.edu/~melamed/courses/102/lectures/huffman.ppt 提供了更清楚的过程。

从上图中我们知道，我们需要创建二叉树来实现Huffman编码。因此，我们需要定义节点 Node 类。

//Huffman trie nodeclass Node{    char ch;    int freq;    Node left, right;    Node(char ch, int freq, Node left, Node right) {        this.ch    = ch;        this.freq  = freq;        this.left  = left;        this.right = right;    }    // is the node a leaf node?    boolean isLeaf() {        assert (left == null && right == null) || (left != null && right != null);        return (left == null && right == null);    }}

有了表示Node的类，我们需要得到每个字符出现的频率。这里，我们用一个大小为256的int数组来保存每个字符出现的频率。然后根据每个字符和它的频率创建Node对象，并把它保存到Priority Queue里。这样，我们就可以根据节点的频率来创建二叉树了。

二叉树创建完毕，我们就可以把每一个字符所对应的编码找出来。

public String[] compress(String in) {    char[] input = in.toCharArray();    // get the frequency counts    int[] freq = new int[R];    for (int i = 0; i < input.length; i++)        freq[input[i]]++;    // initialze priority queuePriorityQueue<Node> pq = new PriorityQueue<Node>(11, new Comparator<Node>() {      public int compare(Node n1, Node n2) {      return n1.freq - n2.freq;      }  });  // insert the node into priority queue    for (char i = 0; i < R; i++)        if (freq[i] > 0)            pq.add(new Node(i, freq[i], null, null));    // merge two smallest trees    while (pq.size() > 1) {        Node left  = pq.poll();        Node right = pq.poll();        Node parent = new Node('\0', left.freq + right.freq, left, right);        pq.add(parent);    }    Node root = pq.poll();    // build code table    String[] table = new String[R];    buildCode(table, root, "");     return table;}

// make a lookup table from symbols and their encodingspublic void buildCode(String[] st, Node x, String s) {    if (!x.isLeaf()) {        buildCode(st, x.left,  s + '0');        buildCode(st, x.right, s + '1');    }    else {        st[x.ch] = s;    }}

一旦得到了查询表，我们就可以把原始字符串进行编码了。

那么，如何对编码进行解码呢？

要对编码进行解码，我们需要利用编码时生成的二叉树。首先我们先从root节点开始，如果编码为0，我们从左走，否则向右走，当遇到叶子节点，我们就把所对应的字符输出来就可以了, 然后再次回到根节点。

public void decode(Node root, Node node, String encoding, StringBuilder sb) {if (encoding.equals("")) return;if (encoding.charAt(0) == '0') {node = node.left;} else {node = node.right;}if (node.isLeaf()) {sb.append(node.ch);node = root; }decode(root, node, encoding.substring(1), sb);}

参考:http://algs4.cs.princeton.edu/55compression/Huffman.java.html
http://www.roading.org