【实验四】无损数据压缩编解码实验

来源：互联网发布：昆山远洋数据上班时间编辑：程序博客网时间：2024/04/27 18:27

实验四、无损数据压缩编解码实验

一、概述

本次实验要求大家掌握霍夫曼编解码实现的数据结构以及具体的实现方法，并在实现的基础之上分析对不同文件进行压缩的效率。

二、实验涉及到基本原理

1.涉及原理汇总

本次实验具体设计的原理有：霍夫曼编码的原理，二叉树与其节点的数据结构，以及对编码效率的分析。

2.霍夫曼编码的原理

霍夫曼编码充分利用了信源概率分布的特性进行编码，是一种最佳的逐个符号的编码方法。

（1）首先将q个信源符号按概率从大到小以递减的次序排列；

（2）用0和1码符号分别分配给概率最小的两个信源符号，并将这两个信源符号合成为一个新的信源符号，从而获得一个新的q-1个规模的信源；

（3）重复上述过程，迭代至信源规模变为1且概率之和为1；

（4）从最后一个节点开始，沿编码路径向前返回，就能得出各信源符号所对应的码字。

3.二叉树的遍历

为了实现霍夫曼编码，本次的程序中使用了二叉树这样的数据结构来生成码树。二叉树这种数据结构不同于我们常见的数组，它并不是纯粹的线性结构，其中的节点并不存在天然的前驱或后继。但在一定约束条件下，例如遍历，我们也可以在树中确立某种线性的次序。

对二叉树进行遍历本身便有多种方式，例如先序遍历，中序遍历与后序遍历等。它们间的区别虽仅是根节点的遍历先后顺序有不同，但这也足以产生相当大的差异了。当然，除这三种遍历方式之外还有更多更加复杂的遍历方式。

本次实验中的霍夫曼编码程序采用的是先序遍历。

4.编码效率的分析

我们本次实验对编码效率的分析从两个角度出发。首先是从理论角度出发，通过计算信源熵以及平均码长来衡量编码的效率；而另一个角度则是直接直观的比较文件的大小，这一种方法相较与单纯从理论角度分析更有现实意义，因为实际的应用中码表本身也是会影响文件大小的。

三、实验流程

首先，调试霍夫曼编码程序。调试成功后，向程序中添加功能，使程序可以以txt文本文档的格式输出，字符发生的概率，字符对应的码字的编码码长以及字符对应的码字。完成该功能后，对不同格式的文件进行霍夫曼编码，并从统计特性以及压缩效率的角度进行分析。

四、关键代码及分析

1.关键代码概述：

本部分内容将率先分析本程序中出现的重要的几个结构体，并着重分析霍夫曼编码器的实现，最后简单介绍一下添加的功能。

2.关键的结构体：

（1）huffman_node

该结构体是本程序中霍夫曼码树的节点的结构体。

isLeaf是用于将节点区分为叶节点和非叶节点（1为叶节点，0为非叶节点），这可以给之后的遍历带来很大的方便。

count则是用于存放该节点所对应的信源符号集合中信源符号出现的个数。

*parent则是该节点的父节点的指针，*zero与*one分别对应着左孩子节点和右孩子节点（习惯上约定左孩子要优先于右孩子，实际上两者地位相当）。

若其为叶节点，symbol内会记录其信源符号。

值得注意的地方是isLeaf和共用体union的部分。由于叶节点是没有子节点的，故共用体的空间由symbol占用；而非叶节点是没有对应信源符号的故，共用体空间由*zero 和 *one占用。如此设计，既完美的保留了叶节点和非叶节点的差异性，又保全了二者同样作为节点的统一属性。此外，还节约了内存，一举多得。

typedef struct huffman_node_tag{unsigned char isLeaf;unsigned long count;struct huffman_node_tag *parent;union{struct{struct huffman_node_tag *zero, *one;};unsigned char symbol;};} huffman_node;

（2）huffman_code

该结构体的功能是用于存放霍夫曼码表。

numbits用于存放编码的码长。

*bits用于存放编码的码字。

typedef struct huffman_code_tag{/* The length of this code in bits. */unsigned long numbits;/* The bits that make up this code. The first   bit is at position 0 in bits[0]. The second   bit is at position 1 in bits[0]. The eighth   bit is at position 7 in bits[0]. The ninth   bit is at position 0 in bits[1]. */unsigned char *bits;} huffman_code;

3.霍夫曼编码器

（1）huffman_encode_file()

以下便是huffman编码的编码器。

其先通过get_symbol_frequencies()对文件进行一次扫描，获得文件中的各个信源符号的概率，完成建立码树的准备工作。

再通过calculate_huffman_codes()，按照霍夫曼编码的原理的四个步骤得出码表。

接下来的过程便是水到渠成的，再对文件进行一次扫描，通过码表得出编码后的文件。

/* * huffman_encode_file huffman encodes in to out. */inthuffman_encode_file(FILE *in, FILE *out){SymbolFrequencies sf;SymbolEncoder *se;//modify by Wu Ruocheng//date:2017.4.24SymbolInformation si;huffman_node *root = NULL;int rc;unsigned int symbol_count;/* Get the frequency of each symbol in the input file. */symbol_count = get_symbol_frequencies(&sf, in);//modify by Wu Ruocheng//date:2017.4.24//get the frequency for SymbolInformationinit_Wu_SymInfo(&si);get_Wu_symbolInfo_frequencies(&sf,&si,symbol_count);/* Build an optimal table from the symbolCount. */se = calculate_huffman_codes(&sf);root = sf[0];//modify by Wu Ruocheng//date:2017.4.24//get the code for SymbolInformationget_Wu_symbloInfo_code(se,&si);print_Wu_symbloInfo(&si);/* Scan the file again and, using the table   previously built, encode it into the output file. */rewind(in);rc = write_code_table(out, se, symbol_count);if(rc == 0)rc = do_file_encode(in, out, se);/* Free the Huffman tree. */free_huffman_tree(root);free_encoder(se);return rc;}

（2）get_symbol_frequencies()

这段函数先用init_frequencies()对SymbolFrequencies *pSF进行初始化分配空间，再对文件进行扫描，用new_leaf_node（）为新出现的信源符号创建对应的叶节点，并对信源符号进行统计将count存放入相应的叶节点结构体中。

static unsigned intget_symbol_frequencies(SymbolFrequencies *pSF, FILE *in){int c;unsigned int total_count = 0;/* Set all frequencies to 0. */init_frequencies(pSF);/* Count the frequency of each symbol in the input file. */while((c = fgetc(in)) != EOF){unsigned char uc = c;if(!(*pSF)[uc])(*pSF)[uc] = new_leaf_node(uc);++(*pSF)[uc]->count;++total_count;}return total_count;}

由下可见，在新建的叶节点中，isLeaf被置1且信源符号被存放入symbol中。

new_leaf_node(unsigned char symbol){huffman_node *p = (huffman_node*)malloc(sizeof(huffman_node));p->isLeaf = 1;p->symbol = symbol;p->count = 0;p->parent = 0;return p;}

（3）calculate_huffman_codes()

毫不夸张的说，这一部分代码是整个霍夫曼编码器中最核心的一部分。

首先通过qsort（）这一快速排序器，根据count的值对整个*pSF进行排序，并统计信源符号的个数n。对应霍夫曼编码方法的第一步：首先，将q个信源符号按概率从大到小以递减的次序排列。

再通过new_nonleaf_node(m1->count+m2->count,m1,m2),将两个count最小的节点合成为一个新的非叶节点。其新节点的*zero对应的是较小的而*one对应的是较大的。并将新的节点地址存放会*pSF数组中（旧的两个节点的地址已不在*pSF中，但它们并没有被是系统释放）。此时*pSF的规模由n降至n-1。对应于霍夫曼编码方法的第二步：用0和1码符号分别分配给概率最小的两个信源符号，并将这两个信源符号合成为一个新的信源符号，从而获得一个新的q-1个规模的信源。

迭代至规模降至1时，*pSF中便仅有最后一个节点的地址，而那对应的便是整个码树的根节点。对应于霍夫曼编码的第三步：重复上述过程，迭代至信源规模变为1且概率之和为1。

最后，通过build_symbol_encoder()由根节点开始先序遍历，递归至每一个叶节点并在该叶节点处通过new_code()生成信源符号对应的码字。对应于霍夫曼编码方法的第四步：从最后一个节点开始，沿编码路径向前返回，就能得出各信源符号所对应的码字。

/* * calculate_huffman_codes turns pSF into an array * with a single entry that is the root of the * huffman tree. The return value is a SymbolEncoder, * which is an array of huffman codes index by symbol value. */static SymbolEncoder*calculate_huffman_codes(SymbolFrequencies * pSF){unsigned int i = 0;unsigned int n = 0;huffman_node *m1 = NULL, *m2 = NULL;SymbolEncoder *pSE = NULL;/* Sort the symbol frequency array by ascending frequency. */qsort((*pSF), MAX_SYMBOLS, sizeof((*pSF)[0]), SFComp);/* Get the number of symbols. */for(n = 0; n < MAX_SYMBOLS && (*pSF)[n]; ++n);/* * Construct a Huffman tree. This code is based * on the algorithm given in Managing Gigabytes * by Ian Witten et al, 2nd edition, page 34. * Note that this implementation uses a simple * count instead of probability. */for(i = 0; i < n - 1; ++i){/* Set m1 and m2 to the two subsets of least probability. */m1 = (*pSF)[0];m2 = (*pSF)[1];/* Replace m1 and m2 with a set {m1, m2} whose probability * is the sum of that of m1 and m2. */(*pSF)[0] = m1->parent = m2->parent =new_nonleaf_node(m1->count + m2->count, m1, m2);(*pSF)[1] = NULL;/* Put newSet into the correct count position in pSF. */qsort((*pSF), n, sizeof((*pSF)[0]), SFComp);}/* Build the SymbolEncoder array from the tree. */pSE = (SymbolEncoder*)malloc(sizeof(SymbolEncoder));memset(pSE, 0, sizeof(SymbolEncoder));build_symbol_encoder((*pSF)[0], pSE);return pSE;}

其中，值得一提的是stdlib.h中自带的快速排序函数。通过将特定的比较器作为参数传递给qsort（），该函数可以实现对一个自定义的数据结构的按照给定的比较器所进行的排序，这是非常方便的。例如，上述程序的排序过程，便是利用了SFComp作为比较器进行排序的。

程序员只需要为一个自定义的数据结构编写需要的比较器，便能通过调用qsort（）来实现排序，不得不说这是相当方便的一个库函数。

/* * When used by qsort, SFComp sorts the array so that * the symbol with the lowest frequency is first. Any * NULL entries will be sorted to the end of the list. */static intSFComp(const void *p1, const void *p2){const huffman_node *hn1 = *(const huffman_node**)p1;const huffman_node *hn2 = *(const huffman_node**)p2;/* Sort all NULLs to the end. */if(hn1 == NULL && hn2 == NULL)return 0;if(hn1 == NULL)return 1;if(hn2 == NULL)return -1;if(hn1->count > hn2->count)return 1;else if(hn1->count < hn2->count)return -1;return 0;}

精妙的实现了对*pSF的排序后，程序通过不断迭代new_nonleaf_node(m1->count+m2->count,m1,m2)来完成码树。与上述的new_leaf_node()最大的不同点便是，非叶节点中不存在symbol变量，取而代之的是zero与one两个指向子节点的指针。这里也体现了共用体的优点。

static huffman_node*new_nonleaf_node(unsigned long count, huffman_node *zero, huffman_node *one){huffman_node *p = (huffman_node*)malloc(sizeof(huffman_node));p->isLeaf = 0;p->count = count;p->zero = zero;p->one = one;p->parent = 0;return p;}

迭代n-1次后，整个*pSF中仅存(*pSF)[0]且该节点为整个二叉树的根节点，此时能够很方便的从根节点开始沿原路径返回叶节点来获得所有叶节点码字。

该程序递归地调用build_symbol_encoder()，来遍历整个码树来生成所有叶节点的码字。其先访问节点中的数据，再访问其左孩子及右孩子，这种遍历的顺序是典型的先序遍历。

当遍历至叶节点便会调用new_code()来生成码字。值得注意的是，对二叉树来说，由叶节点至根节点的路径是唯一的，但由叶节点开始向上迭代至根节点来生成码字，再将比特位颠倒，这种方法会更加容易理解与实现。因为，一个二叉树的非根节点，有且仅有一个父节点；但一个非叶节点却有可能会有两个子节点。这种特性给由根节点向下迭代带来了极大的不便。

/* * build_symbol_encoder builds a SymbolEncoder by walking * down to the leaves of the Huffman tree and then, * for each leaf, determines its code. */static voidbuild_symbol_encoder(huffman_node *subtree, SymbolEncoder *pSF){if(subtree == NULL)return;if(subtree->isLeaf)(*pSF)[subtree->symbol] = new_code(subtree);else{build_symbol_encoder(subtree->zero, pSF);build_symbol_encoder(subtree->one, pSF);}}

/* * new_code builds a huffman_code from a leaf in * a Huffman tree. */static huffman_code*new_code(const huffman_node* leaf){/* Build the huffman code by walking up to * the root node and then reversing the bits, * since the Huffman code is calculated by * walking down the tree. */unsigned long numbits = 0;unsigned char* bits = NULL;huffman_code *p;while(leaf && leaf->parent){huffman_node *parent = leaf->parent;unsigned char cur_bit = (unsigned char)(numbits % 8);unsigned long cur_byte = numbits / 8;/* If we need another byte to hold the code,   then allocate it. */if(cur_bit == 0){size_t newSize = cur_byte + 1;bits = (char*)realloc(bits, newSize);bits[newSize - 1] = 0; /* Initialize the new byte. */}/* If a one must be added then or it in. If a zero * must be added then do nothing, since the byte * was initialized to zero. */if(leaf == parent->one)bits[cur_byte] |= 1 << cur_bit;++numbits;leaf = parent;}if(bits)reverse_bits(bits, numbits);p = (huffman_code*)malloc(sizeof(huffman_code));p->numbits = numbits;p->bits = bits;return p;}

4.我增添的部分代码

我们的实验要求不仅要求需要输出信源符号，其对应码字及其对应码字的码长，还要求输出每个信源出现的概率。为了完成输出每个信源出现的概率这项功能，我有两个思路，其一是直接对码树中的所有叶节点进行遍历，而其二则是在生成码树前从*pSF中获得。虽然，方案一更加有趣并且实现后的收获肯定会更大，但方案二更加简单，为了尽快完成实验，我还是选择使用方案二。

但这并不意味着方案一并不可行，虽然在生成码字的过程（即调用calculate_huffman_codes()）中，*pSF中仅存有根节点的地址，但这并不意味着叶节点无法被寻址了。我们完全可以用类似build_symbol_encoder()的先序遍历的方式获得信源符号出现的次数这一信息，甚至并不需要重新创建一个新的数据类型。

为了实现方案二，为新建了一个数据结构，并添加了几个对该数据结构操作的接口。

(1)huffman_info结构

//modify by Wu Ruocheng//date:2017.4.24typedef struct huffman_info_tag{double frequency;unsigned long numbits;unsigned char *bit;}huffman_info;

//modify by Wu Ruocheng//date:2017.4.24typedef huffman_info* SymbolInformation[MAX_SYMBOLS];

(2)接口

接口函数的功能分别是，初始化，由*sf中获得count，以及从*se中获得码长numbits以及码字bit，然后以txt的形式输出信源符号，信源符号出现次数与其对应的码长及码字。

//modify by Wu Ruocheng//date:2017.4.24//initialize SymbolInformationint init_Wu_SymInfo(SymbolInformation *si){printf("OK_init\n");for (int i = 0; i < MAX_SYMBOLS;i++){(*si)[i] = (huffman_info*)malloc(sizeof(huffman_info));(*si)[i]->bit = 0;(*si)[i]->frequency = 0;(*si)[i]->numbits = 0;}return 0;}int get_Wu_symbolInfo_frequencies(SymbolFrequencies* sf, SymbolInformation* si, unsigned int total_count){printf("OK_get_freq\n");for (int i = 0; i < MAX_SYMBOLS; i++){if ((*sf)[i]){(*si)[i]->frequency = (*sf)[i]->count;}}return 0;}int get_Wu_symbloInfo_code(SymbolEncoder* se, SymbolInformation* si){printf("OK_get_code\n");for (int i = 0; i < MAX_SYMBOLS;i++){if ((*se)[i]){(*si)[i]->numbits = (*se)[i]->numbits;(*si)[i]->bit = (*se)[i]->bits;}}return 0;}int print_Wu_symbloInfo(SymbolInformation *si){printf("OK_get_print\n");FILE* outputfile = NULL;outputfile = fopen("SymbolInfo.txt","w");if (outputfile){printf("can open the SymbolInfo.txt\n");}else{printf("cann't open the SymbolInfo.txt\n");return;}fprintf(outputfile,"Symbol\tLength\tFrequencies\tCode\t\n");for (int i = 0; i < MAX_SYMBOLS; i++){fprintf(outputfile,"%d\t",i);fprintf(outputfile,"%d\t",(int)(*si)[i]->numbits);fprintf(outputfile,"%f\t",(*si)[i]->frequency);for (int j = 0; j < (int)(*si)[i]->numbits; j++){fprintf(outputfile, "%d\t",get_bit((*si)[i]->bit,j));}fprintf(outputfile,"\n");}fclose(outputfile);return 0;}

五、实验结果

1.测试

为了测试添加部分的代码是否能够很好的输出想要的结果，为以之前的测试图片down.yuv作为测试文件输入入程序。得到了以下关于信源符号出现次数的统计图。该yuv图片为256x256的4：2：0的图片，共有98304个像素点（98204个字节）。而另人瞩目的是信源符号为16的个数有10825个之多，这是由于该YUV图像为了应彩色电视系统的要求经过处理，下留了16级作为保护电平。

该图像经霍夫曼变换后的平均码长为7.28125bit。

经测试，添加的代码能满足实验的要求。