LZW Data Compression
来源:互联网 发布:南宁java培训多少钱 编辑:程序博客网 时间:2024/06/11 05:15
转载:
Overview
If you were to take a look at almost any data file on a computer, character by character, you would notice that there are many recurring patterns. LZW is a data compression method that takes advantage of this repetition. The original version of the method was created by Lempel and Ziv in 1978 (LZ78) and was further refined by Welch in 1984, hence the LZW acronym. Like any adaptive/dynamic compression method, the idea is to (1) start with an initial model, (2) read data piece by piece, (3) and update the model and encode the data as you go along. LZW is a "dictionary"-based compression algorithm. This means that instead of tabulating character counts and building trees (as for Huffman encoding), LZW encodes data by referencing a dictionary. Thus, to encode a substring, only a single code number, corresponding to that substring's index in the dictionary, needs to be written to the output file. Although LZW is often explained in the context of compressing text files, it can be used on any type of file. However, it generally performs best on files with repeated substrings, such as text files.
Compression
LZW starts out with a dictionary of 256 characters (in the case of 8 bits) and uses those as the "standard" character set. It then reads data 8 bits at a time (e.g., 't', 'r', etc.) and encodes the data as the number that represents its index in the dictionary. Everytime it comes across a new substring (say, "tr"), it adds it to the dictionary; everytime it comes across a substring it has already seen, it just reads in a new character and concatenates it with the current string to get a new substring. The next time LZW revisits a substring, it will be encoded using a single number. Usually a maximum number of entries (say, 4096) is defined for the dictionary, so that the process doesn't run away with memory. Thus, the codes which are taking place of the substrings in this example are 12 bits long (2^12 = 4096). It is necessary for the codes to be longer in bits than the characters (12 vs. 8 bits), but since many frequently occuring substrings will be replaced by a single code, in the long haul, compression is achieved.
Here's what it might look like in pseudocode:
string s;char ch;...s = empty string;while (there is still data to be read){ ch = read a character; if (dictionary contains s+ch) {s = s+ch; } else {encode s to output file;add s+ch to dictionary;s = ch; }}encode s to output file;
Now, let's suppose our input stream we wish to compress is "banana_bandana", and that we are only using the initial dictionary:
Index Entry 0 a 1 b 2 d 3 n 4 _ (space)
The encoding steps would proceed like this:
Notice that after the last character,"a", is read, the final substring, "ana", must be output.
Uncompression
The uncompression process for LZW is also straightforward. In addition, it has an advantage over static compression methods because no dictionary or other overhead information is necessary for the decoding algorithm--a dictionary identical to the one created during compression is reconstructed during the process.Both encoding and decoding programs must start with the same initial dictionary, in this case, all 256 ASCII characters.
Here's how it works. The LZW decoder first reads in an index (integer), looks up the index in the dictionary, and outputs the substring associated with the index. The first character of this substring is concatenated to the current working string. This new concatenation is added to the dictionary (resimulating how the substrings were added during compression). The decoded string then becomes the current working string (the current index, ie. the substring, is remembered), and the process repeats.
Again, here's what it might look like:
string entry;char ch;int prevcode, currcode;...prevcode = read in a code;decode/output prevcode;while (there is still data to read){ currcode = read in a code; entry = translation of currcode from dictionary; output entry; ch = first char of entry; add ((translation of prevcode)+ch) to dictionary; prevcode = currcode;}
There is an exception where the algorithm fails, and that is when the code calls for an index which has not yet been entered (eg. calling for an index 31 when index 31 is currently being processed and therefore not in the dictionary yet). An example from Sayood will help illustrate this point. Suppose you had the string abababab..... and an initial dictionary of justa & b with indexes 0 & 1, respectively. The encoding process begins:
So, the encoded output starts out 0,1,2,4,... . When we start trying to decode, a problem arises (in the table below, keep in mind that theCurrent String is just the substring that was decoded/translated in the last pass of the loop. Also, theNew Dictionary Entry is created by concatenating the Current String with the first character of the newDictionary Translation):
(意思就是 : New Dictionary Entry 是由 通过 链接 Current String 和 Dictionary Translation 的 first character )
As you can see, the decoder comes across an index of 4 while the entry that belongs there is currently being processed. To understand why this happens, take a look at theencoding table(编码表,就是上2个表). Immediately after(接在...之后) "aba" (with an index of 4) is entered into the dictionary, the next substring that is encoded is an "aba" (ie. the very next code(下一个code) written to the encoded output file is a 4). Thus, the only case in which this special case can occur is if the substring begins and ends with the same character ("aba" is of the form <char><string><char>). So, to deal with this exception, you simply take the substring you have so far, "ab", and concatenate its first character to itself, "ab"+"a" = "aba", instead of following the procedure as normal. Therefore the pseudocode provided above must be altered a bit in order to handle all cases.
(个人理解:上面大概想表达的就是,有异常的情况,例如有这样一种形式,"aba" (就是开头第一个字符与结尾第一个字符一样),当这样形式的subString进入encode table, 接着被encode的substring刚好又是"aba"的话,就会出现上面的"知道了code = 4,但是无法从表中找到4对应的substring",出现异常,那么就需要, you simply take the substring you have so far, "ab", and concatenate its first character to itself, "ab"+"a" = "aba",具体的做法如下:就是当你无法找到对应的code的substring的时候,就可以认为就是 “aba”这种情况,就可以认为这个substring就是上一个substring + 上一个substring的第一个字符)
参考代码:
<span style="white-space:pre"></span>public override void Decompress(BinaryReader reader, BinaryWriter writer) { List<string> list = new List<string>(); for (int i = 0; i < 256; i++) { list.Add(((char)i).ToString()); } byte firstByte = (byte)ReadCode(reader); string match = ((char)firstByte).ToString(); writer.Write(firstByte); int lastPercent = 0; while (reader.BaseStream.Position < reader.BaseStream.Length) { lastPercent = RaiseEvent(reader, lastPercent); int nextCode = ReadCode(reader); string nextMatch = null; // 这里的list.Count是随时会变化的 if (nextCode < list.Count) { nextMatch = list[nextCode]; } <span style="color:#ff0000;">else { nextMatch = match + match[0]; }</span> foreach(char c in nextMatch) { writer.Write((byte)c); } list.Add(match + nextMatch[0]); match = nextMatch; } RaiseFinishEvent(); }
- LZW Data Compression
- LZW Data Compression Algorithm
- LZW Data Compression
- LZW data compression/expansion demonstration program.
- Data Compression
- innodb data compression
- 编码解码 data compression
- Data Compression Algorithms
- 最清晰的LZW Compression Coding和LZW Decompression Decoding 讲解
- Data Compression: The Complete Reference
- Lossless Data Compression(无损数据压缩)
- Supervised data compression via LDA
- LZW
- Zlib: data compression/decompres…
- Compression
- data deduplication (Intelligent compression or single-instance storage)
- Data compression on Hbase will make your mapreduce job fly
- Princeton Algorithms: Part 2 [week 6: Data Compression]
- SVD奇异值分解
- Android内存优化-泛谈 (一)
- NYOJ 又见拦截导弹
- Eclipse is running in a JRE, but a JDK is required 解决方法
- Eclipse下搭建C/C++开发环境教程
- LZW Data Compression
- linux下查找包含关键字的文件
- 使用librtmp进行H264与AAC直播
- Codeforces #292C. Drazil and Factorial 数学
- 计算音频帧的播放时间(音频码流 音频帧)
- Android学习笔记day9
- linux中根据进程的PID值来查找执行文件的及其路径
- 为什么要使用存储过程?
- Linux下多目录的Makefile编写