HandBooks Professional Java-C-Scrip-SQL part 134 pps

Algorithms Discussed Huffman Coding, Arithmetic Coding, Lookup Tables Huffman Coding Huffman coding is widely recognized as the most efficient method of encoding characters for data c

Trang 1

1 Make only one pass through the data for each character position, rather than two as at present?

2 Add the ability to handle integer or floating-point data rather than character data?

3 Add the ability to handle variable-length strings as keys?

(You can find suggested approaches to problems in Chapter artopt.htm)

Footnotes

1 250 blocks of 1000 records, at 16 bytes of overhead per block, yields an overhead of 250*16, or 4000 bytes

2 A description of this sort can be found in Donald Knuth'sKnuth, Donald E

book The Art of Computer Programming, vol 3 Reading, Massachusetts:

Addison-Wesley, 1968 Every programmer should be aware of this book, since it describes a great number of generally applicable algorithms As I write this, a new edition of this classic work on algorithms is about to be published

3 However, distribution counting is not suitable for use where the data will be

kept in virtual memory See my article "Galloping Algorithms", in Windows Tech Journal, 2(February 1993), 40-43, for details on this limitation

4 As is standard in C and C++ library implementations, this version of qsort requires the user to supply the address of a function that will compare the items to be sorted While this additional overhead biases the comparison slightly against qsort, this small disadvantage is not of the same order of magnitude as the difference in inherent efficiency of the two algorithms

5 Of course, we could just as well have used the keys that start with any other character

6 There are some situations where this is not strictly true For example,

suppose we want to read a large fraction of the records in a file in physical order, and the records are only a few hundred bytes or less In that case, it is almost certainly faster to read them all with a big buffer and skip the ones

we aren't interested in The gain from reading large chunks of data at once is likely to outweigh the time lost in reading some unwanted records

7 Please note that there is a capacity constraint in this program relating to the total number of ZIP entries that can be processed See Figure mail.00a for details on this constraint

Cn U Rd Ths (Qkly)? A Data Compression Utility

Trang 2

Introduction

In this chapter we will examine the Huffman coding and arithmetic coding

methods of data compression and develop an implementation of the latter

algorithm The arithmetic coding algorithm allows a tradeoff among memory consumption and compression ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as

embedded code within a larger program

Algorithms Discussed

Huffman Coding, Arithmetic Coding, Lookup Tables

Huffman Coding

Huffman coding is widely recognized as the most efficient method of encoding

characters for data compression This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded

in the fewest bits For example, suppose we have a message made up of the letters 'A', 'B', and 'C', which can occur in any combination Figure huffman.freq shows the relative frequency of each of these letters and the Huffman code assigned to each one

Huffman code table (Figure huffman.freq)

Huffman Fixed-length

Letter Frequency Code Code

+ -+ -+ -+

A | 1/4 | 00 | 00 |

B | 1/4 | 01 | 01 |

C | 1/2 | 1 | 10 |

+ -+ -+ -+

The codes are determined by the frequencies: as mentioned above, the letter with the greatest frequency, 'C', has the shortest code, of one bit The other two letters, 'A' and 'B', have longer codes On the other hand, the simplest code to represent any of three characters would use two bits for each character How would the length of an encoded message be affected by using the Huffman code rather than the fixed-length one?

Trang 3

Let's encode the message "CABCCABC" using both codes The results are shown

in Figure huffman.fixed

Huffman vs fixed-length coding (Figure huffman.fixed)

Huffman Fixed-length

Letter Code Code

+ -+ -+

C | 1 | 10 |

A | 00 | 00 |

B | 01 | 01 |

C | 1 | 10 |

A | 00 | 00 |

B | 01 | 01 |

C | 1 | 10 |

+ -+ -+

Total bits used 12 16

Here we have saved one-fourth of the bits required to encode this message; often, the compression can be much greater Since we ordinarily use an eight-bit ASCII code to represent characters, if one of those characters (such as carriage return or line feed) accounts for a large fraction of the characters in a file, giving it a short code of two or three bits can reduce the size of the file noticeably Let's see how arithmetic coding1 would encode the first three characters of the same message, "CAB" One-character messages (Figure aritha1) Cum Previous Current Output

Message Freq Freq Codes Output Output So Far

+ -+ + -+ -+ -+ -+

A | 16 | 16 |000000( 0)-001111(15)| None | 00 | 00 |

B | 16 | 32 |010000(16)-011111(31)| None | 01 | 01 |

* C | 32 | 64 |100000(32)-111111(63)| None | 1 | 1 |

+ -+ + -+ -+ -+ -+

Figure aritha1 is the first of several figures which contain the information needed

to determine how arithmetic coding would encode messages of up to three

characters from an alphabet consisting of the letters 'A', 'B', and 'C', with

frequencies of 1/4, 1/4, and 1/2, respectively The frequency of a message

Trang 4

composed of three characters chosen independently is the product of the

frequencies of those characters Since the lowest common denominator of these three fractions is 1/4, the frequency of any three-character message will be a

multiple of (1/4)3, or 1/64 For example, the frequency of the message "CAB" will

be (1/2)*(1/4)*(1/4), or 1/32 (=2/64) For this reason, we will express all of the frequency values in terms of 1/64ths

Thus, the "Freq." column signifies the expected frequency of occurrence of each message, in units of 1/64; the "Cum Freq." column accumulates the values in the first column; the "Codes" column indicates the range of codes that can represent each message2; the "Previous Output" column shows the bits that have been output before the current character was encoded; the "Current Output" column indicates what output we can produce at this point in the encoding process; and the "Output

So Far" column shows the cumulative output for that message, starting with the first character encoded

As the table indicates, since the first character happens to be a 'C', then we can output "1", because all possible messages starting with 'C' have codes starting with

a "1" Let's continue with Figure aritha2 to see the encoding for a two-character message

Two-character messages (Figure aritha2)

Cum Previous Current Output

+ -+ + -+ -+ -+ -+

AA | 4 | 4 |000000(00)-000011(03)| 00 | 00 | 0000 |

AB | 4 | 8 |000100(04)-000111(07)| 00 | 01 | 0001 |

AC | 8 | 16 |001000(08)-001111(15)| 00 | 1 | 001 |

BA | 4 | 20 |010000(16)-010011(19)| 01 | 00 | 0100 |

BB | 4 | 24 |010100(20)-010111(23)| 01 | 01 | 0101 |

BC | 8 | 32 |011000(24)-011111(31)| 01 | 1 | 011 |

* CA | 8 | 40 |100000(32)-100111(39)| 1 | 00 | 100 |

CB | 8 | 48 |101000(40)-101111(47)| 1 | 01 | 101 |

CC | 16 | 64 |110000(48)-111111(63)| 1 | 1 | 11 |

+ -+ + -+ -+ -+ -+

After encoding the first two characters of our message, "CA", our cumulative output is "100", since the range of codes for messages starting with "CA" is from

"100000" to "100111"; all these codes start with "100" The whole three-character message is encoded as shown in Figure aritha3

Trang 5

We have generated exactly the same output from the same input as we did with Huffman coding So far, this seems to be an exercise in futility; is arithmetic coding just another name for Huffman coding?

These two algorithms provide the same compression efficiency only when the frequencies of the characters to be encoded happen to be representable as integral powers of 1/2, as was the case in our examples so far; however, consider the frequency table shown in Figure huffman.poor

Three-character messages (Figure aritha3)

Cum Previous Current Output

+ -+ + -+ -+ -+ -+

AAA | 1 | 1 |000000(00)-000000(00)| 0000 | 00 | 000000|

AAB | 1 | 2 |000001(01)-000001(01)| 0000 | 10 | 000001|

AAC | 2 | 4 |000010(02)-000011(03)| 0000 | 1 | 00001 |

ABA | 1 | 5 |000100(04)-000100(04)| 0001 | 00 | 000100|

ABB | 1 | 6 |000101(05)-000101(05)| 0001 | 01 | 000101|

ABC | 2 | 8 |000110(06)-000111(07)| 0001 | 1 | 00011 |

ACA | 2 | 10 |001000(08)-001001(09)| 001 | 00 | 00100 |

ACB | 2 | 12 |001010(10)-001011(11)| 001 | 01 | 00101 |

ACC | 4 | 16 |001100(12)-001111(15)| 001 | 1 | 0011 |

BAA | 1 | 17 |010000(16)-010000(16)| 0100 | 00 | 010000|

BAB | 1 | 18 |010001(17)-010001(17)| 0100 | 01 | 010001|

BAC | 2 | 20 |010010(18)-010011(19)| 0100 | 1 | 01001 |

BBA | 1 | 21 |010100(20)-010100(20)| 0101 | 00 | 010100|

BBB | 1 | 22 |010101(21)-010101(21)| 0101 | 01 | 010101|

BBC | 2 | 24 |010110(22)-010111(23)| 0101 | 1 | 01011 |

BCA | 2 | 26 |011000(24)-011001(25)| 011 | 00 | 01100 |

BCB | 2 | 28 |011010(26)-011111(27)| 011 | 01 | 01101 |

BCC | 4 | 32 |011100(28)-011111(31)| 011 | 1 | 0111 |

CAA | 2 | 34 |100000(32)-100001(33)| 100 | 00 | 10000 |

* CAB | 2 | 36 |100010(34)-100011(35)| 100 | 01 | 10001 |

CAC | 4 | 40 |100100(36)-100111(39)| 100 | 1 | 1001 |

CBA | 2 | 42 |101000(40)-101001(41)| 101 | 00 | 10100 |

CBB | 2 | 44 |101010(42)-101011(43)| 101 | 01 | 10101 |

CBC | 4 | 48 |101100(44)-101111(47)| 101 | 1 | 1011 |

CCA | 4 | 52 |110000(48)-110011(51)| 11 | 00 | 1100 |

CCB | 4 | 56 |110100(52)-110111(55)| 11 | 01 | 1101 |

CCC | 8 | 64 |110000(56)-111111(63)| 11 | 1 | 111 |

Định dạng
Số trang	6
Dung lượng	26,66 KB