Huffman encoding is therefore only effective for long files where the savings in the message is enough to offset the cost, or in situations where the coding trie can be precomputed and u
Trang 1the message: this means that we need to save the tree along with the message
in order to decode it Fortunately, this does not present any real difficulty
It is actually necessary only to store the code array, because the radix search trie which results from inserting the entries from that array into an initially empty tree is the decoding tree
Thus, the storage savings quoted above is not entirely accurate, because the message can’t be decoded without the trie and we must take into account the cost of storing the trie (i.e., the code array) along with the message Huffman encoding is therefore only effective for long files where the savings in the message is enough to offset the cost, or in situations where the coding trie can be precomputed and used for a large number of messages For example, a trie based on the frequencies of occurrence of letters in the English language could be used for text documents For that matter, a trie based on the frequency of occurrence of characters in Pascal programs could be used for encoding programs (for example, “;” is likely to be near the top of such a trie) A Huffman encoding algorithm saves about 23% when run on the text for this chapter
As before, for truly random files, even this clever encoding scheme won’t work because each character will occur approximately the same number of times, which will lead to a fully balanced coding tree and an equal number of bits per letter in the code
I
Trang 21
2.
3.
4.
5.
6.
7.
8.
9.
10.
Exercises
Implement compression and expansion procedures for the run-length en-coding method for a fixed alphabet described in the text, using Q as the escape character
Could “QQ” occur somewhere in a file compressed using the method described in the text? Could “QQ&” occur?
Implement compression and expansion procedures for the binary file en-coding method described in the text
The letter “q” given in the text can be processed as a sequence of five-bit characters Discuss the pros and cons of doing so in order to use a character-based run-length encoding method
Draw a Huffman coding tree for the string “ABRACADABRA.” How many bits does the encoded message require?
What is the Huffman code for a binary file? Give an example showing the maximum number of bits that could be used in a Huffman code for a N-character ternary (three-valued) file
Suppose that the frequencies of the occurrence of all the characters to be encoded are different Is the Huffman encoding tree unique?
Huffman coding could be extended in a straightforward way t,o encode
in two-bit characters (using 4-way trees) What would be the main advantage and the main disadvantage of doing so?
What would be the result of breaking up a Huffman-encoded string into five-bit characters and Huffman encoding that string?
Implement a procedure to decode a Huffman-encoded string, given the code and len arrays.
Trang 3In the previous chapter we looked at methods for encoding strings of characters to save space Of course, there is another very important reason to encode strings of characters: to keep them secret
Cryptology, the study of systems for secret communications, consists of two competing fields of study: cryptography, the design of secret
communica-tions systems, and cryptanalysis, the study of ways to compromise secret com-munications systems The main application of cryptology has been in military and diplomatic communications systems, but other significant applications are becoming apparent Two principal examples are computer file systems (where each user would prefer to keep his files private) and “electronic funds transfer” systems (where very large amounts of money are involved) A com-puter user wants to keep his comcom-puter files just as private as papers in his file cabinet, and a bank wants electronic funds transfer to be just as secure
as funds transfer by armored car
Except for military applications, we assume that cryptographers are “good guys” and cryptanalysts are “bad guys”: our goal is to protect our computer files and our bank accounts from criminals If this point of view seems some-what unfriendly, it must be noted (without being over-philosophical) that by using cryptography one is assuming the existence of unfriendliness! Of course, even “good guys” must know something about cryptanalysis, since the very best way to be sure that a system is secure is to try to compromise it yourself (Also, there are several documented instances of wars being brought to an end, and many lives saved, through successes in cryptanalysis.)
Cryptology has many close connections with computer science and al-gorithms, especially the arithmetic and string-processing algorithms that we have studied Indeed, the art (science?) of cryptology has an intimate relation-ship with computers and computer science that is only beginning to be fully understood Like algorithms, cryptosystems have been around far longer
295
Trang 4296 CHAPTER 23
than computers Secrecy system design and algorithm design have a common heritage, and the same people are attracted to both
It is not entirely clear which branch of cryptology has been affected most
by the availability of computers The cryptographer now has available a much more powerful encryption machine than before, but this also gives him more room to make a mistake The cryptanalyst has much more powerful tools for breaking codes, but the codes to be broken are more complicated than ever before Cryptanalysis can place an incredible strain on computational resources; not only was it among the first applications areas for computers, but it still remains a principal applications area for modern supercomputers More recently, the widespread use of computers has led to the emergence
of a variety of important new applications for cryptology, as mentioned above New cryptographic methods have recently been developed appropriate for such applications, and these have led to the discovery of a fundamental relationship between cryptology and an important area of theoretical computer science that we’ll examine briefly in Chapter 40
In this chapter, we’ll examine some of the basic characteristics of cryp-tographic algorithms because of the importance of cryptography in modern computer systems and because of close relationships with many of the algo-rithms we have studied We’ll refrain from delving into detailed implementa-tions: cryptography is certainly a field that should be left to experts While it’s not difficult to “keep people honest” by encrypting things with a simple cryptographic algorithm, it is dangerous to rely upon a method implemented
by a non-expert
Rules of the Game
All the elements that go into providing a means for secure communications between two individuals together are called a cryptosystem The canonical
structure of a typical cryptosystem is diagramed below:
“attack
at dawn”
The sender (S) wishes to send a message (called the plaintezt) to the
receiver (R) To do so, he transforms the plaintext into a secret form suitable
Trang 5for transmission (called the ciphertext) using a cryptographic algorithm (the encryption method) and some key (K) parameters To read the message, the receiver must have a matching cryptographic algorithm (the decryption
method) and the same key parameters, which he can use to transform the
ciphertext back into the plaintext, the message It is usually assumed that the ciphertext is sent over insecure communications lines and is available to the cryptanalyst (A) It also is usually assumed that the encryption and decryption methods are known to the cryptanalyst: his aim is to recover the plaintext from the ciphertext without knowing the key parameters Note that the whole system depends on some separate prior method of communication between the sender and receiver to agree on the key parameters As a rule, the more key parameters, the more secure the cryptosystem is but the more inconvenient it is to use This situation is akin to that for more conventional security systems: a combination safe is more secure with more numbers on the combination lock, but it is harder to remember the combination The parallel with conventional systems also serves as a reminder that any security system
is only as secure as the trustworthiness of the people that have the key
It is important to remember that economic questions play a central role
in cryptosystems There is an economic motivation to build simple encryption and decryption devices (since many may need to be provided and complicated devices cost more) Also, there is an economic motivation to reduce the amount of key information that must be distributed (since a very secure and expensive method of communications must be used) Balanced against the cost
of implementing cryptographic algorithms and distributing key information
is the amount of money the cryptanalyst would be willing to pay to break the system For most applications, it is the cryptographer’s aim to develop a low-cost system with the property that it would cost the cryptanalyst much more to read messages than he would be willing to pay For a few applications,
a “provably secure” cryptosystem may be required: one for which it can be ensured that the cryptanalyst can never read messages no matter what he is willing to spend (The very high stakes in some applications of cryptology naturally imply that very large amounts of money are used for cryptanalysis.)
In algorithm design, we try to keep track of costs to help us choose the best algorithms; in cryptology, costs play a central role in the design process
Simple Methods
Among the simplest (and among the oldest) methods for encryption is the Caesar cipher: if a letter in the plaintext is the Nth letter in the alphabet,
replace it by the (N + K)th letter in the alphabet, where K is some fixed integer (Caesar used K = 3) For example, the table below shows how a message is encrypted using this method with K = 1:
Trang 6298 CHAPTER 23
Plaintext: A T T A C K A T D A W N Ciphertext: BUUBDLABUAEB X 0 This method is weak because the cryptanalyst has only to guess the value
of K: by trying each of the 26 choices, he can be sure that he will read the
message
A far better method is to use a general table to define the substitution
to be made: for each letter in the plaintext, the table tells which letter to put
in the ciphertext For example, if the table gives the correspondence
ABCDEFGHI JKLMNOPQRSTUVWXYZ THE QUICKBROWNFXJMPDVRLAZYG
then the message is encrypted as follows:
Plaintext: A T T A C K A T D A W N Ciphertext: HWH OTHVTQHAF This is much more powerful than the simple Caesar cipher because the crypt-analyst would have to try many more (about 27! > 1028) tables to be sure
of reading the message However, “simple substitution” ciphers like this are easy to break because of letter frequencies inherent in the language For ex-ample, since E is the most frequent letter in English text, the cryptanalyst could get good start on reading the message by looking for the most frequent letter in the ciphertext and assuming that it is to be replaced by E While this might not be the right choice, he certainly is better off than if he had to try all 26 letters He can do even better by looking at two-letter combinations (“digrams”): certain digrams (such as QJ) never occur in English text while others (such as ER) are very common By examining frequencies of letters and combinations of letters, a cryptanalyst can very easily break a simple substitution cipher
One way to make this type of attack more difficult is to use more than one table A simple example of this is an extension of the Caesar cipher called the Vigenere cipher: a small repeated key is used to determine the value of K
for each letter At each step, the key letter index is added to the plaintext letter index to determine the ciphertext letter index Our sample plaintext, with the key ABC, is encrypted as follows:
Plaintext: A T T A C K A T D A W N Ciphertext: BVWBENACWAFDX P
Trang 7For example, the last letter of the ciphertext is P, the 16th letter of the alphabet, because the corresponding plaintext letter is N (the 14th letter), and the corresponding key letter is B (the 2nd letter)
The Vigenere cipher can obviously be made more complicated by using different general tables for each letter of the plaintext (rather than simple offsets) Also, it is obvious that the longer the key, the better In fact, if the key is as long as the plaintext, we have the V&am cipher, more commonly
called the one-time pad This is the only provably secure cryptosystem known,
and it is reportedly used for the Washington-Moscow hotline and other vital applications Since each key letter is used only once, the cryptanalyst can
do no better than try every possible key letter for every message position,
an obviously hopeless situation since this is as difficult as trying all possible messages However, using each key letter only once obviously leads to a severe key distribution problem, and the one-time pad is only useful for relatively short messages which are to be sent infrequently
If the message and key are encoded in binary, a more common scheme for position-by-position encryption is to use the “exclusive-or” function: to encrypt the plaintext, “exclusive-or” it (bit by bit) with the key An attractive feature of this method is that decryption is the same operation as encryption: the ciphertext is the exclusive-or of the plaintext and the key, but doing another exclusive-or of the ciphertext and the key returns the plaintext Notice that the exclusive-or of the ciphertext and the plaintext is the key This seems surprising at first, but actually many cryptographic systems have the property that the cryptanalyst can discover the key if he knows the plaintext
Encryption/Decryption Machines
Many cryptographic applications (for example, voice systems for military communications) involve the transmission of large amounts of data, and this makes the one-time pad infeasible What is needed is an approximation to the one-time pad in which a large amount of “pseudo-key” can be generated from a small amount of true key to be distributed
The usual setup in such situations is as follows: an encryption machine
is fed some cryptovariables (true key) by the sender, which it uses to generate
a long stream of key bits (pseudo-key) The exclusive-or of these bits and the plaintext forms the ciphertext The receiver, having a similar machine and the same cryptovariables, uses them to generate the same key stream to exclusive-or against the ciphertext and to retrieve the plaintext
Key generation in this context is obviously very much like random number generation, and our random number generation methods are appropriate for key generation (the cryptovariables are the initial seeds of the random number
Trang 8300 CHAPTER 23
generator) In fact, the linear feedback shift registers that we discussed in Chapter 3 were first developed for use in encryption/decryption machines such as described here However, key generators have to be somewhat more complicated than random number generators, because there are easy ways to attack simple linear feedback shift registers The problem is that it might be easy for the cryptanalyst to get some plaintext (for example, silence in a voice system), and therefore some key If the cryptanalyst can get enough key that
he has the entire contents of the shift register, then he can get all the key from that point on
Cryptographers have several ways to avoid such problems One way is to make the feedback function itself a cryptovariable It is usually assumed that the cryptanalyst knows everything about the structure of the machine (maybe
he stole one) except the cryptovariables, but if some of the cryptovariables are used to “configure” the machine, he may have difficulty finding their values Another method commonly used to confuse the cryptanalyst is the product
cipher, where two different machines are combined to produce a complicated
key stream (or to drive each other) Another method is nonlinear substitution; here the translation between plaintext and ciphertext is done in large chunks, not bit-by-bit The general problem with such complex methods is that they can be too complicated for even the cryptographer to understand and that there always is the possibility that things may degenerate badly for some choices of the cryptovariables
Public-Key Cryptosystems
In commercial applications such as electronic funds transfer and (real) com-puter mail, the key distribution problem is even more onerous than in the traditional applications of cryptography The prospect of providing long keys (which must be changed often) to every citizen, while still maintain-ing both security and cost-effectiveness, certainly inhibits the development of such systems Methods have recently been developed, however, which promise
to eliminate the key distribution problem completely Such systems, called
public-key cryptosystems, are likely to come into widespread use in the near
future One of the most prominent of these systems is based on some of the arithmetic algorithms that we have been studying, so we will take a close look
at how it works
The idea in public-key cryptosystems is to use a “phone book” of encryp-tion keys Everyone’s encrypencryp-tion key (denoted by P) is public knowledge: a person’s key could be listed, for example, next to his number in the telephone book Everyone also has a secret key used for decryption; this secret key (denoted by S) is not known to anyone else To transmit a message M, the sender looks up the receiver’s public key, uses it to encrypt the message, and then transmits the message We’ll denote the encrypted message (ciphertext)
Trang 9by C=P(M) The receiver uses his private decryption key to decrypt and read the message For this system to work we must have at least the following properties:
(4 S(P(M))=M for every message M.
(ii) All (S,P) pairs are distinct
(iii) Deriving S from P is as hard as reading M
(iv) Both S and P are easy to compute
The first of these is a fundamental cryptographic property, the second two provide the security, and the fourth makes the system feasible for use This general scheme was outlined by W Diffie and M Hellman in 1976, but they had no method which satisfied all of these properties Such a method was discovered soon afterwards by R Rivest, A Shamir, and L Adleman Their scheme, which has come to be known as the RSA public-key cryptosystem, is based on arithmetic algorithms performed on very large
integers The encryption key P is the integer pair (N,p) and the decryption key S is the integer pair (N,s), where s is kept secret These numbers are intended to be very large (typically, N might be 200 digits and p and s might
be 100 digits) The encryption and decryption methods are then simple: first the message is broken up into numbers less than N (for example, by taking
lg N bits at a time from the binary string corresponding to the character encoding of the message) Then these numbers are independently raised to a power modulo N: to encrypt a (piece of a) message M, compute C = P(M) = MPmod N, and to decrypt a ciphertext C, compute M = S(C) = C”mod
N This computation can be quickly and easily performed by modifying the elementary exponentiation algorithm that we studied in Chapter 4 to take the remainder when divided by N after each multiplication (No more than
2 log N such operations are required for each piece of the message, so the tot,al number of operations (on 100 digit numbers!) required is linear in the number
of bits in the message.)
Property (iv) above is therefore satisfied, and property (ii) can be easily enforced We still must make sure that the cryptovariables N, p, and s can be chosen so as to satisfy properties (i) and (iii) To be convinced of these requires
an exposition of number theory which is beyond the scope of this book, but
we can outline the main ideas First, it is necessary to generate three large (approximately loo-digit) “random” prime numbers: the largest will be s and we’ll call the other two x and y Then N is chosen to be the product of x and y, and p is chosen so that ps mod (x - l)(y - 1) = 1 It is possible to prove that, with N, p, and s chosen in this way, we have Mps mod N = M for all
messages M.
More specifically, each large prime can be generated by generating a large random number, then testing successive numbers starting at that point until
Trang 10302 CHAPTER 23
a prime is found One simple method performs a calculation on a random number that, with probability l/2, will “prove” that the number to be tested
is not prime (A number which is not prime will survive 20 applications of this test less than one time out of a million, 30 applications less than 1 time out of a billion.) The last step is to compute p: it turns out that a variant of Euclid’s algorithm (see Chapter 1) is just what is needed
Furthermore, s seems to be difficult to compute from knowledge of p (and N), though no one has been able to prove that to be the case Apparently, finding p from s requires knowledge of x and y, and apparently it is necessary
to factor N to calculate x and y But factoring N is thought to be very difficult: the best factoring algorithms known would take millions of years to factor a 200-digit number, using current technology
An attractive feature of the RSA system is that the complicated com-putations involving N, p, and s are performed only once for each user who subscribes to the system, which the much more frequent operations of encryp-tion and decrypencryp-tion involve only breaking up the message and applying the simple exponentiation procedure This computational simplicity, combined with all the convenience features provided by public-key cryptosystems, make this system quite attractive for secure communications, especially on computer systems and networks
The RSA method has its drawbacks: the exponentiation procedure is ac-tually expensive by cryptographic standards, and, worse, there is the linger-ing possibility that it might be possible to read messages encrypted uslinger-ing the method This is true with many cryptosystems: a cryptographic method must withstand serious cryptanalytic attacks before it can be used with confidence Several other methods have been suggested for implementing public-key cryptosystems Some of the most interesting are linked to an important class
of problems which are generally thought to be very hard (though this is not known for sure), which we’ll discuss in Chapter 40 These cryptosystems have the interesting property that a successful attack could provide insight on how to solve some well-known difficult unsolved problems (as with factoring for the RSA method) This link between cryptology and fundamental topics
in computer science research, along with the potential for widespread use of public-key cryptography, have made this an active area of current research
r l