文本压缩1

来源：互联网发布：数据字典怎么画编辑：程序博客网时间：2024/05/21 17:31

Green hand
文本压缩的模型就是要预测（或统计）字符出现的概率，模型提供这种字符的概率分布函数，再有解码器应用相同的分布函数进行解码。下面实现初步的字符级的模型。
Equation: Entropy=Sum(-P[i]*log[P[i]])

Semi-static modeling
At the first sight at the text, we calculate the possibility of each character (i.e. i–P[i]), then we utilize the equation(*1) to set up the length of each character’s code.
Adaptive modeling
We start with a smooth PDF of characters, then calculate the possibilities of each character from just the text we just have received, e.g. with a 1000-characters passage, while we have decoding or encoding at the 400th character and the word ‘u’ has been found 20 times in these 400 read words, we put P[‘u’]=20.0/400. In this way, both encoding and decoding share the same PDF model. To avoid the ‘zero-frequency’ issue, we initiate each character first appearing 1 time.
Canonical Huffman modeling
Taking this case for instance, with the using of a casual Huffman model, decoding n characters requisites n-1 inner nodes and n leaves, which each of these leaves acquires 2 pointers, on the Huffman tree. Finally, we need 4n words to decode n words, and in practice, with decoding 1MB words to storing 16MB memory at most.

Comparing to the case of canonical Huffman tree, we use just n+100 memory.

Canonical Huffman tree is a subset of Huffman tree.
First, we provide the principles and some parameters:
Principles:
(*1). the codes should be with good coherence, e.g. 3D,4D,5D
(*2). the 1st code with length-i can be calculate from the last code with length-(i-1) using the equation(*2)
(*3). the 1st minimal length code should be 0D
Parameters:
firstcode[i]: the first code with i-length, we can calculate it with equation(*2), it’s truly a binary code;
numl[i]: total amount of i-length code;
index[i]: the index of the first i-length-code in the dictionary.
Equation(2): firstcode[i]=2(last_code[i-1]+1), firstcode[min_len]=0

Second, construct the code words:
e.g.
words ‘a’~’u’ with the code length ‘a’-3, (‘b’:’i’)-4, (‘j’:’u’)-5, with Principal-1 we could get ‘a’ with code ‘000b’. With Principle-2 we can easily get ‘b’ with ‘0010b’, ‘c’ with ‘0011b’ etc.

Finally, decoding algorithm:
长度为i的码字的前j位的数值大于长度为j的码字的数值.
we first find out the actual length of the next pending code and the deviation between code and firstcode[i] can assist us to locate the location in the dictionary.

Python codes:
”’

Only character-level with 26 English characters to be as an example, without complimenting encoding Canonical Huffman Model

#!/usr/bin/env pythonimport redef lines(file):    '''    to seperate single characters into a list and add '\n' at the end    '''    for i in file: yield i    yield '\n'def blocks(file):    '''    to seperate words into a list and returns this list    '''    b=[]    for i in lines(file):        if i.strip():            b.append(i)        elif b:             yield ''.join(b).strip()            b=[]def word_index():    '''    we need to change 'utf-8' to unicode first to compare, to do this, we need to ignore errors 'cause we can    we also should to ignore cases like 'A'&'a'    finally we'd better sort this word list    '''    vocabulary=[]    total=0    with open('./casual/te.txt') as f:        for i in blocks(f.read()):            if i.lower() not in vocabulary:                flag=True                for j in i:                    jc=unicode(j,'utf-8',errors='ignore')                    #if any char in the single word is not an English character, throw it                    if not ((jc>=u'\u0041' and jc<=u'\u005a') or (jc>=u'\u0061' and jc<=u'\u007a')):                        flag=False                if flag:                    vocabulary.append(i.lower())            total+=1    vocabulary.sort()    print vocabulary    print totaldef semiStaticModeling():    '''    build up a semi-static model here for Haffman codes    in this model, we should first read through the whole passage and build a static model    Calculate possibility of each character    '''    #chars=[]    total=0    chars=[0 for i in range(26)]    with open('./casual/te.txt') as f:        for i in lines(f.read()):            if i.strip():                #chars.append(unicode(i.lower(), errors='ignore'))                total+=1                if i == 'a': chars[0]+=1                elif i == 'b': chars[1]+=1                elif i == 'c': chars[2]+=1                elif i == 'd': chars[3]+=1                elif i == 'e': chars[4]+=1                elif i == 'f': chars[5]+=1                elif i == 'g': chars[6]+=1                elif i == 'h': chars[7]+=1                elif i == 'i': chars[8]+=1                elif i == 'j': chars[9]+=1                elif i == 'k': chars[10]+=1                elif i == 'l': chars[11]+=1                elif i == 'm': chars[12]+=1                elif i == 'n': chars[13]+=1                elif i == 'o': chars[14]+=1                elif i == 'p': chars[15]+=1                elif i == 'q': chars[16]+=1                elif i == 'r': chars[17]+=1                elif i == 's': chars[18]+=1                elif i == 't': chars[19]+=1                elif i == 'u': chars[20]+=1                elif i == 'v': chars[21]+=1                elif i == 'w': chars[22]+=1                elif i == 'x': chars[23]+=1                elif i == 'y': chars[24]+=1                elif i == 'z': chars[25]+=1    for i in chars: print float(i)/total    return charsdef adaModel(file):    '''    adaptive modeling    zero-ordered, character-level model    to avoid 'zero-frequency' problem, we initiate the 26 characters to appear at the first for 1 time    ONLY decode 26 English characters for example, and so does the 'filedes'    attributes:        filedes: to locate the position(number) in the passage    '''    filedes=0    chars=[]    chars_a=1    pa=1.0/26    with open(file) as f:        for i in lines(f.read()):            #if 'i' is not a whitespace character, append it            if i.strip():                chars.append(i.lower())                filedes+=1                if unicode(i, errors='ignore') == 'a':                    chars_a+=1                    pa=float(float(pa)*filedes+1)/filedes                    print filedes, padef canonicalM(code, firstcode, index, table):    '''    DECODE    Canonical Huffman modeling decoding method    Attributes:        firstcode[i]: the first code with i-length, we can calculate it with equation(*1)        numl[i]: total number of i-length code        index[i]: the index of the first i-length-code        l: length of the codes        table: store the characters in a table    (*1): firstcode[i]=2*(last_code[i-1]+1), firstcode[min_len]=0    e.g.  --http://blog.csdn.net/goncely/article/details/616589        firstcode[3:5] = 000b, 0010b, 10100b        numl[3:5] = 1(a), 8(b~i), (j~u)        index[3:5] = 0, 1, 9     '''    l=1    while(code>=firstcode[l]):        code<<=1        l+=1    #beneath is the right length    l-=1    print table[index[l]+code-firstcode[l]]

参考http://blog.csdn.net/goncely/article/details/616589

0 0