Coursera作业之哈夫曼编码树

来源：互联网发布：java web项目编辑：程序博客网时间：2024/06/06 14:15

https://class.coursera.org/progfun-003/assignment/view?assignment_id=15

前言废话:
此次作业比前几次花的时间更多,共用了大概6小时,其中有一道的瞄了一眼网友的思路(貌似他的solution还是错的,但是毕竟我还是瞄了一眼,给我了一些灵感和启发)
还有一道题目是看了助教对于此题的一个小提示

通过这几次的做题目,突然发现在Coursera上刷题目比玩游戏还要有趣,第一次提交答案得分9.55,觉得不爽,一定要到10分才满意,是不是这也算是一种强迫症呢?还是完美主义?
此外,我感觉这个课程的老师出的题目实在是太棒了!有些题目的解法真的很巧妙很经典,给出题目的一些Hint也都是恰到好处,有醍醐灌顶之效
多年了,让我重新感受到了当年高中时做题的那种感受,有些题目,思考再三,突然灵机一动,写下了寥寥只字,correct,就和这门课程的作业很像,函数式编程,代码不在多而在于精,往往10行不到的代码,一个功能复杂的函数就实现了

写于2013.12.14 晚

以下省略一些对于哈夫曼编码树的简介

图中的数值代表权值,字符串代表对应的节点的字符编码串

注意:
Note that a given encoding is only optimal if the character frequencies in the encoded text match the weights in the code tree.
哈夫曼编码树只有在字符出现频率和树中的权值(weight)相等时,这个哈弗曼编码树才能称作是最佳(optimal)的

在作业中,有一道练习如果没考虑周全就会出现生成的哈弗曼编码树不是最佳的,以此扣分

编码(Encoding)
对于给定的一个编码树,从根部开始遍历到树叶,往左就要加0,往右就加1,直到树叶,比如上面的那个编码树字母D就编码成1011

解码(Decoding)
方法和编码相反,给定一个编码树和一个编码后的串,从根部遍历到叶子,得到一个字符,并重复此步骤,比如10001010解码后就是BAC

题目:
给定类:

abstract class CodeTreecase class Fork (left: CodeTree, right: CodeTree, chars: List[Char], weight: Int) extends CodeTreecase class Leaf(char: Char, weight: Int) extends CodeTree

上手题:

编写函数weight,返回tree的权值

def weight(tree: CodeTree): Int = tree match ...chars which returns the list of characters defined in a given Huffman tree.def chars(tree: CodeTree): List[Char] = tree match ...def weight(tree: CodeTree): Int = tree match {  case Fork(left, right, chars, wght) => weight(left) + weight(right)  case Leaf(char, wght)               => wght}def chars(tree: CodeTree): List[Char] = tree match {  case Fork(left, right, chs, wght) => chars(left) ::: chars(right)  case Leaf(ch, weight)             => List(ch)}

这两题没什么难度,通过前几次作业对于递归思想的练习,很容易就能写出,用到左右分而治之的思路

题目:
构建哈夫曼编码树
Given a text, it’s possible to calculate and build an optimal Huffman tree in the sense that the encoding of that text will be of the minimum possible length, meanwhile keeping all information (i.e., it is lossless).
给定一个字符串,构建出一个最优的哈夫曼编码树,让那个字符串的编码达到尽可能最短长度

也就是最终要能实现这样一个函数
def createCodeTree(chars: List[Char]): CodeTree = ...

我们把这个函数的实现分为多个步骤(也就是先实现一些辅助函数)

写一个函数times计算每个character在字符串中的次数
比如times(List('a', 'b', 'a')),就返回List(('a', 2), ('b', 1))
类似于一个HashMap
def times(chars: List[Char]): List[(Char, Int)] = ...

我的思路是:
用一个accumulator累加器作为返回值ret,每次添加字符时首先check accumulator中有没有这个字符,有的话,去除accumulator头部那个,并添加一个头部元素(key,value+1)的一个元组
因为scala的值都是immutable的,所以无法修改原先值,只能删除并添加新的
代码如下:

def times(chars: List[Char]): List[(Char, Int)] = {  def hashMap(chs: List[Char], list: List[(Char, Int)]): List[(Char, Int)] = {    if (chs.isEmpty) list    else hashMap(chs.tail, addElement(chs.head, list))  }  def addElement(ch: Char, list: List[(Char, Int)]): List[(Char, Int)] = {    if (list.isEmpty) (ch, 1) :: list    // 如果找到的话,那么取出head,创建一个新的元组,并添加到头部    else if (list.head._1 == ch) (ch, list.head._2 + 1) :: list.tail    // 没找到就继续递归找    else list.head :: addElement(ch, list.tail)  }  hashMap(chars, List[(Char, Int)]())}

实现一个函数,返回从小到大排序freqs的列表
def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = ...

其实就是用到了课上说过的插入排序

def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = {  def insertSort(pair: (Char, Int), list: List[Leaf]): List[Leaf] = {    if (list.isEmpty) List(Leaf(pair._1, pair._2))    else if (pair._2 <= list.head.weight) Leaf(pair._1, pair._2) :: list    else list.head :: insertSort(pair, list.tail)  }  def loopInsert(list: List[(Char, Int)], retList: List[Leaf]): List[Leaf] = {    if (list.isEmpty) retList    else loopInsert(list.tail, insertSort(list.head, retList))  }  loopInsert(freqs, List[Leaf]())}

写一个singleton函数判断trees是不是只含有一个树
其实意思就是trees里是不是只包含一个元素

def singleton(trees: List[CodeTree]): Boolean = ...def singleton(trees: List[CodeTree]): Boolean = {  !trees.isEmpty && trees.tail.isEmpty}

Write a function combine which (1) removes the two trees with the lowest weight from the list constructed in the previous step,
and (2) merges them by creating a new node of type Fork. Add this new tree to the list - which is now one element shorter - while preserving the order (by weight).

实现combine函数,这个函数的功能是取出trees中头两个数,把他们组合成一个新树,并添加到原来的数的列表中去,同时删除两个旧的树
看一下testcase就明白了

val leaflist = List(Leaf('e', 1), Leaf('t', 2))assert(combine(leaflist) === List(Fork(Leaf('e', 1), Leaf('t', 2), List('e', 't'), 3)))

有一句话我漏看了:
while preserving the order (by weight).
题目的意思是combine之后的列表还要继续是排序的,我一开始没做,结果调试了1个多小时

def combine(trees: List[CodeTree]): List[CodeTree] = ...

def combine(trees: List[CodeTree]): List[CodeTree] = {    def insertSort(ins: CodeTree, list: List[CodeTree]): List[CodeTree] = {      if (list.isEmpty) List(ins)      else if (weight(ins) <= weight(list.head)) ins :: list      else list.head :: insertSort(ins, list.tail)    }  if (trees.isEmpty || trees.tail.isEmpty) trees  else {    val l = trees.head    val r = trees.tail.head    // 我之前错误的是直接返回了,没有写insertSort函数并调用    // Fork(l, r, chars(l) ::: chars(r), weight(l) + weight(r)) :: trees.tail.tail    val fork = Fork(l, r, chars(l) ::: chars(r), weight(l) + weight(r)) //:: trees.tail.tail    insertSort(fork, trees.tail.tail)  }}

Write a function until which calls the two functions defined above until this list contains only a single tree. This tree is the optimal coding tree. The function until can be used in the following way:
until(singleton, combine)(trees)
where the argument trees is of the type List[CodeTree].

这题看上去很花哨,要实现一个until函数,题目中没有给出明确地函数签名,只有
def until(xxx => ???, yyy => ??? )(zzz :???): List[CodeTree] = ???

其实就是写一个函数,最后的调法就是
until(singleton, combine)(trees)
想了想,大概意思就是,不停地调用combine(trees),直到trees是singleton为止

def until(siglFunc: List[CodeTree] => Boolean,          combFunc: List[CodeTree] => List[CodeTree])(trees: List[CodeTree]): List[CodeTree] = {  if (siglFunc(trees)) trees  else until(siglFunc, combFunc)(combFunc((trees)))}

最后实现createCodeTree,用于创建编码树

def createCodeTree(chars: List[Char]): CodeTree = {  until(singleton, combine)(makeOrderedLeafList(times(chars))).head}

以上这些函数实现完成之后,哈夫曼编码树的构建函数就实现了
(仅仅是这棵树有了,编码,解码暂时还没实现)

解码:
type Bit = Int
实现函数decode,输入是编码后的串bits,输出是字符串
思路大体是:
1. 如果串没有到叶子节点,那么继续通过查看串首的值是0或1确定遍历左边或右边
2. 如果是叶子节点,那么输出一个字符,如果串还没完的话,继续遍历(此时是递归串,而非串的tail,因为串的当前head并没有被用到)
def decode(tree: CodeTree, bits: List[Bit]): List[Char] = ...

def decode(tree: CodeTree, bits: List[Bit]): List[Char] = {  def help(t: CodeTree, b: List[Bit], ret: List[Char]): List[Char] = {    t match {      case Fork(left, right, chs, wght) => if (b.head == 0) help(left, b.tail, ret) else help(right, b.tail, ret)      case Leaf(ch, weight) => {        // here can't use help(tree, b.tail, ret ::: List(ch))        if (b.isEmpty) ret ::: List(ch) else help(tree, b, ret ::: List(ch))      }    }  }  help(tree, bits, List())}

编码:
This section deals with the Huffman encoding of a sequence of characters into a sequence of bits.

定义一个函数encode,使得对于给定的编码树tree,输入一个字符串,返回一个编码后的串

Your implementation must traverse the coding tree for each character, a task that should be done using a helper function.
你必须为每个字符遍历整个编码树(效率很低,但是这是一个可以work的solution,题目要求先尝试做一下)
def encode(tree: CodeTree)(text: List[Char]): List[Bit] = ...

def encode(tree: CodeTree)(text: List[Char]): List[Bit] = {  def encdChar(t: CodeTree, c: Char, ret: List[Bit]): List[Bit] = {    t match {      // not ret::encdChar(left, c, ret ::: List(0)) ::: encdChar(right, c, ret ::: List(1))      case Fork(left, right, chs, wght) => encdChar(left, c, ret ::: List(0)) ::: encdChar(right, c, ret ::: List(1))      case Leaf(ch, weight)             => if (c == ch) ret else List()    }  }  def encd(t: CodeTree, x: List[Char], ret: List[Bit]): List[Bit] = {    if (x.isEmpty) ret    else encdChar(t, x.head, ret) ::: encd(t, x.tail, ret)  }  encd(tree, text, List[Bit]())}

很难看的一个函数,对于每个字符,都遍历一遍树的左边和右边,找到对应的编码
(很容易出错)

题目又翻花样,说其实可以写一个好一点的encode函数,暂时取名为quickEncode
def quickEncode(tree: CodeTree)(text: List[Char]): List[Bit] = ...

为了实现这个函数,我们首先要定义一个类型:
type CodeTable = List[(Char, List[Bit])]
类似于一个HashMap,给定一个字符,输出一个编码串

encoding步骤中会有一个codeBits函数,给定一个table,和字符,输出一个编码串
def codeBits(table: CodeTable)(char: Char): List[Bit] = ...

CodeTable 的创建是由函数convert做的,它遍历整个编码树,生成这样一个表
def convert(t: CodeTree): CodeTable = ...

而convert又是由mergeCodeTables函数实现的
def mergeCodeTables(a: CodeTable, b: CodeTable): CodeTable = ...

然后就是让我们实现以上所提到的函数

def quickEncode(tree: CodeTree)(text: List[Char]): List[Bit] = {  if (text.isEmpty) List()  else codeBits(convert(tree))(text.head) ::: quickEncode(tree)(text.tail)}

这个很简单,codeBits查找第一个char的编码,并递归要编码的后面的字符串

def codeBits(table: CodeTable)(char: Char): List[Bit] = {  if (table.head._1 == char) table.head._2  else codeBits(table.tail)(char)}

从表中查字符也很简单,如果找到就返回,否者递归调用

def convert(tree: CodeTree): CodeTable = {  //      def encdChar(t: CodeTree, bits: List[Bit], ret: CodeTable): CodeTable = {  //        t match {  //          case Fork(left, right, chs, wght) => encdChar(left, bits ::: List(0), ret) ::: encdChar(right, bits ::: List(1), ret)  //          case Leaf(ch, weight)             => (ch, bits) :: ret  //        }  //      }  //    encdChar(tree, List[Bit](), List[(Char, List[Bit])]())  tree match {    case Fork(left, right, chs, wght) => mergeCodeTables(convert(left), convert(right))    case Leaf(ch, weight)             => List((ch, List[Bit]()))  }}

比较有意思的函数来了!
我一开始没有用mergeCodeTables来实现convert,而是像之前一样遍历tree(注释里的代码),后来做完之后看到说要用mergeCodeTables
去论坛上看,也有同学提过类似的问题
但是助教回答说,要仔细思考mergeCodeTables函数的作用,并思考mergeCodeTables对于叶子节点有什么意义

于是我开始思考mergeCodeTables的实现,因为不知道mergeCodeTables的实现,就没法用mergeCodeTables写出convert
思考了大约15分钟后,发现了一个现象

其实对于任何两个要合并的节点,都有以下规律:
X
/ \
X0 X1

什么意思呢,就是说,假设M,N 两个节点要合并成K,那么M的编码肯定是K编码加上0,N的编码是K编码加上1

再深入地反过来想一想,假设M,N是叶子,他们现在的CodeTable都是空,当他们合并时
mergeCodeTables(M,N)
就要返回一个CodeTable,里面是((M,0),(N,1))
再进一步,如果此时S节点要和这个新节点(M,N的合并)合并mergeCodeTables(S,((M,0),(N,1)))
那么合并出来的节点就是((S,0),(M,10),(N,11))

规律就是,当有合并操作时,左边老节点的bits的头部要加0,右边老节点bits头部要加1

于是写下了:

def mergeCodeTables(a: CodeTable, b: CodeTable): CodeTable = {  // for each item in a, insert 0 in front of the item  // for each item in b, insert 1 in front of the item  def help(t: CodeTable, bit: Bit): CodeTable = {    if (t.isEmpty) t    else (t.head._1, bit :: t.head._2) :: help(t.tail, bit)  }  help(a, 0) ::: help(b, 1)}

(不得不惊叹出题人的思路,妙哉妙哉!!!)

0 0