hdu 1053 Entropy

来源：互联网发布：mac拼音声调怎么打编辑：程序博客网时间：2024/05/09 23:02

Entropy

Time Limit : 2000/1000ms(Java/Other) MemoryLimit : 65536/32768K (Java/Other)

Total Submission(s) :3 AcceptedSubmission(s) : 2

Problem Description

An entropy encoder is a data encoding method that achieves losslessdata compression by encoding a message with “wasted” or “extra”information removed. In other words, entropy encoding removesinformation that was not necessary in the first place to accuratelyencode the message. A high degree of entropy implies a message witha great deal of wasted information; english text encoded in ASCIIis an example of a message type that has very high entropy. Alreadycompressed messages, such as JPEG graphics or ZIP archives, havevery little entropy and do not benefit from further attempts atentropy encoding.

English text encoded in ASCII has a high degree of entropy becauseall characters are encoded using the same number of bits, eight. Itis a known fact that the letters E, L, N, R, S and T occur at aconsiderably higher frequency than do most other letters in englishtext. If a way could be found to encode just these letters withfour bits, then the new encoding would be smaller, would containall the original information, and would have less entropy. ASCIIuses a fixed number of bits for a reason, however: it’s easy, sinceone is always dealing with a fixed number of bits to represent eachpossible glyph or character. How would an encoding scheme that usedfour bits for the above letters be able to distinguish between thefour-bit codes and eight-bit codes? This seemingly difficultproblem is solved using what is known as a “prefix-freevariable-length” encoding.

In such an encoding, any number of bits can be used to representany glyph, and glyphs not present in the message are simply notencoded. However, in order to be able to recover the information,no bit pattern that encodes a glyph is allowed to be the prefix ofany other encoding bit pattern. This allows the encoded bitstreamto be read bit by bit, and whenever a set of bits is encounteredthat represents a glyph, that glyph can be decoded. If theprefix-free constraint was not enforced, then such a decoding wouldbe impossible.

Consider the text “AAAAABCD”. Using ASCII, encoding this wouldrequire 64 bits. If, instead, we encode “A” with the bit pattern“00”, “B” with “01”, “C” with “10”, and “D” with “11” then we canencode this text in only 16 bits; the resulting bit pattern wouldbe “0000000000011011”. This is still a fixed-length encoding,however; we’re using two bits per glyph instead of eight. Since theglyph “A” occurs with greater frequency, could we do better byencoding it with fewer bits? In fact we can, but in order tomaintain a prefix-free encoding, some of the other bit patternswill become longer than two bits. An optimal encoding is to encode“A” with “0”, “B” with “10”, “C” with “110”, and “D” with “111”.(This is clearly not the only optimal encoding, as it is obviousthat the encodings for B, C and D could be interchanged freely forany given encoding without increasing the size of the final encodedmessage.) Using this encoding, the message encodes in only 13 bitsto “0000010110111”, a compression ratio of 4.9 to 1 (that is, eachbit in the final encoded message represents as much information asdid 4.9 bits in the original encoding). Read through this bitpattern from left to right and you’ll see that the prefix-freeencoding makes it simple to decode this into the original text eventhough the codes have varying bit lengths.

As a second example, consider the text “THE CAT IN THE HAT”. Inthis text, the letter “T” and the space character both occur withthe highest frequency, so they will clearly have the shortestencoding bit patterns in an optimal encoding. The letters “C”, “I’and “N” only occur once, however, so they will have the longestcodes.

There are many possible sets of prefix-free variable-length bitpatterns that would yield the optimal encoding, that is, that wouldallow the text to be encoded in the fewest number of bits. One suchoptimal encoding is to encode spaces with “00”, “A” with “100”, “C”with “1110”, “E” with “1111”, “H” with “110”, “I” with “1010”, “N”with “1011” and “T” with “01”. The optimal encoding thereforerequires only 51 bits compared to the 144 that would be necessaryto encode the message with 8-bit ASCII encoding, a compressionratio of 2.8 to 1.

Input

The input file will contain a list of text strings, one per line.The text strings will consist only of uppercase alphanumericcharacters and underscores (which are used in place of spaces). Theend of the input will be signalled by a line containing only theword “END” as the text string. This line should not beprocessed.

Output

For each text string in the input, output the length in bits of the8-bit ASCII encoding, the length in bits of an optimal prefix-freevariable-length encoding, and the compression ratio accurate to onedecimal point.

Sample Input

AAAAABCDTHE_CAT_IN_THE_HATEND

Sample Output

64 13 4.9
144 51 2.8

Source

Greater New York 2000

==================================================================================================

统计每一个字符出现的次数，然后根据出现次数进行建树，然后搜索树，叶子结点在第几层，就说明该结点的编码位数为几位。

假定给出n个结点ki(i=1‥n)，其权值分别为wi(i=1‥n)。要构造以此n个结点为叶结点的最优二叉树，其构造方法如下：

首先，将给定的n个结点构成n棵二叉树的集合F={T1，T2，……，Tn}。其中每棵二叉树Ti中只有一个权值为wi的根结点ki，其左、右子树均为空。然后做以下两步

⑴在F中选取根结点权值最小的两棵二叉树作为左右子树，构造一棵新的二叉树，并且置新的二叉树的根结点的权值为其左、右子树根结点的权值之和；

⑵在F中删除这两棵二叉树，同时将新得到的二叉树加入F中；

重复⑴、⑵，直到在F中只含有一棵二叉树为止。这棵二叉树便是最优二叉树。

以上构造最优二叉树的方法称为哈夫曼（huffmann）算法

#include#include#include#include#includeusing namespace std;struct tree{char ch;int count;int deep;tree *left,*right;tree(){left = right = NULL,deep = count = 0,ch = '?';}friend bool operator<(tree a,tree b){return a.count>b.count;}};struct kind{char ch;int count;}letter[201];int length;int sum;priority_queuePriorQueue;void Huffman(){sum = 0;int i;tree *a,*b,node,*c,root;queueq;for (i=0;i{node.count = letter[i].count;node.ch = letter[i].ch;PriorQueue.push(node);}while (PriorQueue.size()!=1){a = new tree;*a = PriorQueue.top(),PriorQueue.pop();b = new tree;*b = PriorQueue.top(),PriorQueue.pop();c = new tree;c->count = a->count+b->count;c->left = a,c->right = b;PriorQueue.push(*c);}root = PriorQueue.top(),PriorQueue.pop(),root.deep = 0;q.push(root);while (!q.empty()){node = q.front(),q.pop();if (node.left){node.left->deep = node.deep+1;q.push(*node.left);}if (node.right){node.right->deep = node.deep+1;q.push(*node.right);}if(!node.left&&!node.right)sum+=node.deep*node.count;}}int main(){char str[1005],i,len,count;while (scanf("%s",str)&&strcmp(str,"END")!=0){len = strlen(str);str[len] = '!';sort(str,str+len);for (length = 0,count=1,i=1;i<=len;i++){if (str[i]!=str[i-1]){letter[length].ch = str[i-1];letter[length++].count = count;count = 1;}elsecount++;}if(length==1)printf("%d %d 8.0\n",8*len,len);else{Huffman();printf("%d %d %.1lf\n",len*8,sum,len*8*1.0/sum);}}return 0;}