文本压缩1
来源:互联网 发布:数据字典怎么画 编辑:程序博客网 时间:2024/05/21 17:31
Green hand
文本压缩的模型就是要预测(或统计)字符出现的概率,模型提供这种字符的概率分布函数,再有解码器应用相同的分布函数进行解码。下面实现初步的字符级的模型。
Equation: Entropy=Sum(-P[i]*log[P[i]])
Semi-static modeling
At the first sight at the text, we calculate the possibility of each character (i.e. i–P[i]), then we utilize the equation(*1) to set up the length of each character’s code.Adaptive modeling
We start with a smooth PDF of characters, then calculate the possibilities of each character from just the text we just have received, e.g. with a 1000-characters passage, while we have decoding or encoding at the 400th character and the word ‘u’ has been found 20 times in these 400 read words, we put P[‘u’]=20.0/400. In this way, both encoding and decoding share the same PDF model. To avoid the ‘zero-frequency’ issue, we initiate each character first appearing 1 time.Canonical Huffman modeling
Taking this case for instance, with the using of a casual Huffman model, decoding n characters requisites n-1 inner nodes and n leaves, which each of these leaves acquires 2 pointers, on the Huffman tree. Finally, we need 4n words to decode n words, and in practice, with decoding 1MB words to storing 16MB memory at most.
Comparing to the case of canonical Huffman tree, we use just n+100 memory.
Canonical Huffman tree is a subset of Huffman tree.
First, we provide the principles and some parameters:
Principles:
(*1). the codes should be with good coherence, e.g. 3D,4D,5D
(*2). the 1st code with length-i can be calculate from the last code with length-(i-1) using the equation(*2)
(*3). the 1st minimal length code should be 0D
Parameters:
firstcode[i]: the first code with i-length, we can calculate it with equation(*2), it’s truly a binary code;
numl[i]: total amount of i-length code;
index[i]: the index of the first i-length-code in the dictionary.
Equation(2): firstcode[i]=2(last_code[i-1]+1), firstcode[min_len]=0
Second, construct the code words:
e.g.
words ‘a’~’u’ with the code length ‘a’-3, (‘b’:’i’)-4, (‘j’:’u’)-5, with Principal-1 we could get ‘a’ with code ‘000b’. With Principle-2 we can easily get ‘b’ with ‘0010b’, ‘c’ with ‘0011b’ etc.
Finally, decoding algorithm:
长度为i的码字的前j位的数值大于长度为j的码字的数值.
we first find out the actual length of the next pending code and the deviation between code and firstcode[i] can assist us to locate the location in the dictionary.
Python codes:
”’
Only character-level with 26 English characters to be as an example, without complimenting encoding Canonical Huffman Model
#!/usr/bin/env pythonimport redef lines(file): ''' to seperate single characters into a list and add '\n' at the end ''' for i in file: yield i yield '\n'def blocks(file): ''' to seperate words into a list and returns this list ''' b=[] for i in lines(file): if i.strip(): b.append(i) elif b: yield ''.join(b).strip() b=[]def word_index(): ''' we need to change 'utf-8' to unicode first to compare, to do this, we need to ignore errors 'cause we can we also should to ignore cases like 'A'&'a' finally we'd better sort this word list ''' vocabulary=[] total=0 with open('./casual/te.txt') as f: for i in blocks(f.read()): if i.lower() not in vocabulary: flag=True for j in i: jc=unicode(j,'utf-8',errors='ignore') #if any char in the single word is not an English character, throw it if not ((jc>=u'\u0041' and jc<=u'\u005a') or (jc>=u'\u0061' and jc<=u'\u007a')): flag=False if flag: vocabulary.append(i.lower()) total+=1 vocabulary.sort() print vocabulary print totaldef semiStaticModeling(): ''' build up a semi-static model here for Haffman codes in this model, we should first read through the whole passage and build a static model Calculate possibility of each character ''' #chars=[] total=0 chars=[0 for i in range(26)] with open('./casual/te.txt') as f: for i in lines(f.read()): if i.strip(): #chars.append(unicode(i.lower(), errors='ignore')) total+=1 if i == 'a': chars[0]+=1 elif i == 'b': chars[1]+=1 elif i == 'c': chars[2]+=1 elif i == 'd': chars[3]+=1 elif i == 'e': chars[4]+=1 elif i == 'f': chars[5]+=1 elif i == 'g': chars[6]+=1 elif i == 'h': chars[7]+=1 elif i == 'i': chars[8]+=1 elif i == 'j': chars[9]+=1 elif i == 'k': chars[10]+=1 elif i == 'l': chars[11]+=1 elif i == 'm': chars[12]+=1 elif i == 'n': chars[13]+=1 elif i == 'o': chars[14]+=1 elif i == 'p': chars[15]+=1 elif i == 'q': chars[16]+=1 elif i == 'r': chars[17]+=1 elif i == 's': chars[18]+=1 elif i == 't': chars[19]+=1 elif i == 'u': chars[20]+=1 elif i == 'v': chars[21]+=1 elif i == 'w': chars[22]+=1 elif i == 'x': chars[23]+=1 elif i == 'y': chars[24]+=1 elif i == 'z': chars[25]+=1 for i in chars: print float(i)/total return charsdef adaModel(file): ''' adaptive modeling zero-ordered, character-level model to avoid 'zero-frequency' problem, we initiate the 26 characters to appear at the first for 1 time ONLY decode 26 English characters for example, and so does the 'filedes' attributes: filedes: to locate the position(number) in the passage ''' filedes=0 chars=[] chars_a=1 pa=1.0/26 with open(file) as f: for i in lines(f.read()): #if 'i' is not a whitespace character, append it if i.strip(): chars.append(i.lower()) filedes+=1 if unicode(i, errors='ignore') == 'a': chars_a+=1 pa=float(float(pa)*filedes+1)/filedes print filedes, padef canonicalM(code, firstcode, index, table): ''' DECODE Canonical Huffman modeling decoding method Attributes: firstcode[i]: the first code with i-length, we can calculate it with equation(*1) numl[i]: total number of i-length code index[i]: the index of the first i-length-code l: length of the codes table: store the characters in a table (*1): firstcode[i]=2*(last_code[i-1]+1), firstcode[min_len]=0 e.g. --http://blog.csdn.net/goncely/article/details/616589 firstcode[3:5] = 000b, 0010b, 10100b numl[3:5] = 1(a), 8(b~i), (j~u) index[3:5] = 0, 1, 9 ''' l=1 while(code>=firstcode[l]): code<<=1 l+=1 #beneath is the right length l-=1 print table[index[l]+code-firstcode[l]]
参考http://blog.csdn.net/goncely/article/details/616589
- 文本压缩1
- 文本压缩
- 文本压缩
- Managing Gigabytes--文本压缩
- 文本压缩理论简介
- CSharp_SevenZipSharp压缩解压文本
- 13. 压缩文本
- 文本压缩过滤器实现
- Linux压缩文本及文件
- LZW 文本压缩及解压
- 常见的文本压缩算法
- 压缩文本、字节或者文件的压缩辅助类-GZipHelper
- java实现文本和文件的压缩和解压缩
- Huffman编解码实现文本压缩
- 通过mod_deflate进行HTTP文本压缩
- 哈夫曼树实验(文本压缩与解压)
- c#中用ICSharpCode.SharpZipLib实现文本压缩
- 基于huffman编码的文本压缩程序
- CXF动态客户端的一些原理性知识总结
- CFLAGS、CXXFLAGS、LDFLAGS与LIBS
- 技术负责人在创业进阶中如何蜕变?
- nodejs模块xml2js解析xml的坑
- 正则表达式在iOS中的运用
- 文本压缩1
- 【Android应用开发技术:网络通信】Android HTTP编程
- c++ 运算符优先级
- redhat系列软件包管理
- JMeter性能测试基础 (2) - 变量的使用
- WIN7下怎么建立VPN服务器
- 将Vim配置成为一款强大的编辑工具之 自动补全
- Android四大组件之Broadcast Receiver
- MongoDB学习十四 --MongoDB的分片