1. Trang chủ
  2. » Công Nghệ Thông Tin

ALGORITHMS phần 6 pdf

55 226 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Parsing Algorithms and Context-Free Grammars
Trường học Unknown University
Chuyên ngành Computer Science
Thể loại lecture notes
Năm xuất bản Unknown Year
Thành phố Unknown City
Định dạng
Số trang 55
Dung lượng 693,43 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

For example, the context-lan-free grammar which defines the set of all legal regular expressions as described in the previous chapter is givenbelow.. expression : : = term 1 term + expre

Trang 1

21 Parsing

Several fundamental algorithms have been developed to recognize legalcomputer programs and to decomI:ose their structure into a form suitablefor further processing This operation, called parsing, has application beyondcomputer science, since it is directly related to the study of the structure

of language in general For example, parsing plays an important role in tems which try to “understand” natural (human) languages and in systemsfor translating from one language to another One particular case of inter-est is translating from a “high-level” co.nputer language like Pascal (suitablefor human use) to a “low-level” assembly or machine language (suitable formachine execution) A program for doing such a translation is called a com-piler

sys-Two general approaches are used for parsing Top-down methods lookfor a legal program by first looking for parts of a legal program, then lookingfor parts of parts, etc until the pieces are small enough to match the inputdirectly Bottom-up methods put pieces of the input together in a structuredway making bigger and bigger pieces until a legal program is constructed

In general, top-down methods are recursive, bottom-up methods are iterative;top-down methods are thought to be easier to implement, bottom-up methodsare thought to be more efficient

A full treatment of the issues involved in parser and compiler constructionwould clearly be beyond the scope of thi>, book However, by building a simple

“compiler” to complete the pattern-mats:hing algorithm of the previous ter, we will be able to consider some of’ the fundamental concepts involved.First we’ll construct a top-down parser for a simple language for describingregular expressions Then we’ll modify the parser to make a program whichtranslates regular expressions into pattern-matching machines for use by thematch procedure of the previous chapter

chap-Our intent in this chapter is to give some feeling for the basic principles

269

Trang 2

270 CHAPTER 21

of parsing and compiling while at the same time developing a useful patternmatching algorithm Certainly we cannot treat the issues involved at thelevel of depth that they deserve The reader should be warned that subtledifficulties are likely to arise in applying the same approach to similar prob-lems, and advised that compiler construction is a quite well-developed fieldwith a variety of advanced methods available for serious applications

Context-Free Grammars

Before we can write a program to determine whether a program written in

a given language is legal, we need a description of exactly what constitutes

a legal program This description is called a grammar: to appreciate the minology, think of the language as English and read “sentence” for “program”

ter-in the previous sentence (except for the first occurrence!) Programmter-ing guages are often described by a particular type of grammar called a context-

lan-free grammar For example, the context-lan-free grammar which defines the set

of all legal regular expressions (as described in the previous chapter) is givenbelow

(expression) : : = (term) 1 (term) + (expression) (term) ::= (factor) 1 (factor)(term)

(factor) ::= ((expression)) ( 21 1 (factor)*

This grammar describes regular expressions like those that we used in the lastchapter, such as (l+Ol)*(O+l) or (A*B+AC)D Each line in the grammar iscalled a production or replacement rule The productions consist of terminal

symbols (, ), + and * which are the symbols used in the language beingdescribed (‘91,” a special symbol, stands for any letter or digit); nonterminalsymbols (expression), (term), and (factor) which are internal to the grammar;

and metasymbols I:= and ( which are used to describe the meaning of the

productions The ::= symbol, which may be read 2s a,” defines the left-handside of the production in terms of the right-hand side; and the 1 symbol, whichmay be read as “or” indicates alternative choices The various productions,though expressed in this concise symbolic notation, correspond in a simpleway to an intuitive description of the grammar For example, the secondproduction in the example grammar might be read “a (term) is a (factor)

or a (factor) followed by a (term).” One nonterminal symbol, in this case (expreswon), is distinguished in the sense that a string of terminal symbols is

in the language described by the grammar if and only if there is some way touse the productions to derive that string from the distinguished nonterminal

by replacing (in any number of steps) a nonterminal symbol by any of the “or”clauses on the right-hand side of a production for that nonterminal symbol

Trang 3

PARSING 271

One natural way to describe the result of this derivation process is called

a purse tree: a diagram of the complete grammatical structure of the stringbeing parsed For example, the following parse tree shows that the string(A*B+AC)D is in the language described by the above grammar

The circled internal nodes labeled E, F, a.nd T represent (expression), (factor),

and (term), respectively Parse trees like this are sometimes used for English,

to break down a “sentence” into “subject,” “verb,” “object,” etc

The main function of a parser is to accept strings which can be so derivedand reject those that cannot, by attempting to construct a parse tree forany given string That is, the parser can recognize whether a string is inthe language described by the grammar by determining whether or not thereexists a parse tree for the string Top-down parsers do so by building thetree starting with the distinguished nonterminal at the top, working downtowards the string to be recognized at the bottom; bottom-up parsers do this

by starting with the string at the bottom, working backwards up towards thedistinguished nonterminal at the top

As we’ll see, if the strings being reo>gnized also have meanings implyingfurther processing, then the parser can convert them into an internal repre-sentation which can facilitate such processing

Another example of a context-free grammar may be found in the dix of the Pascal User Manual and Report: it describes legal Pascal programs.

appen-The principles considered in this section for recognizing and using legal pressions apply directly to the complex job of compiling and executing Pascal

Trang 4

ex-272 CHAPTER 21

programs For example, the following grammar describes a very small subset

of Pascal, arithmetic expressions involving addition and multiplication

(expression) ::= (term) 1 (term) + (expression)(term) ::= (factor) 1 (factor)* (term)(factor) ::= ((expression)) ) 21

Again, w is a special symbol which stands for any letter, but in this grammarthe letters are likely to represent variables with numeric values Examples oflegal strings for this grammar are A+(B*C) and (A+B*C)*D*(A+(B+C))

As we have defined things, some strings are perfectly legal both as metic expressions and as regular expressions For example, A*(B+C) mightmean “add B to C and multiply the result by A” or “take any number of A’sfollowed by either B or C.” This points out the obvious fact that checkingwhether a string is legally formed is one thing, but understanding what itmeans is quite another We’ll return to this issue after we’ve seen how toparse a string to check whether or not it is described by some grammar.Each regular expression is itself an example of a context-free grammar:any language which can be described by a regular expression can also bedescribed by a context-free grammar The converse is not true: for example,the concept of “balancing” parentheses can’t be captured with regular ex-pressions Other types of grammars can describe languages which can’t bedescribed by context-free grammars For example, context-sensitive grammarsare the same as those above except that the left-hand sides of productionsneed not be single nonterminals The differences between classes of languagesand a hierarchy of grammars for describing them have been very carefullyworked out and form a beautiful theory which lies at the heart of computerscience

arith-Top-Down Parsing

One parsing method uses recursion to recognize strings from the languagedescribed exactly as specified by the grammar Put simply, the grammar issuch a complete specification of the language that it can be turned directlyinto a program!

Each production corresponds to a procedure with the name of the terminal on the left-hand side Nonterminals on the right-hand side of theinput correspond to (possibly recursive) procedure calls; terminals correspond

non-to scanning the input string For example, the following procedure is part of

a top-down parser for our regular expression grammar:

Trang 5

An array p contains the regular expre:;sion being parsed, with an index jpointing to the character currently begin examined To parse a given regularexpression, we put it in p[l M], (with a sentinel character in p[M+l] which

is not used in the grammar) set j to 1, and call expression If this results in

j being set to M+1, then the regular ex 3ression is in the language described

by the grammar Otherwise, we’ll see below how various error conditions arehandled

The first thing that expression does is call term, which has a slightly more

A direct translation from the grammar would simply have term call factor

and then term This obviously won’t work because it leaves no way toexit from term: this program would go into an infinite recursive loop ifcalled (Such loops have particularly unpleasant effects in many systems.)The implementation above gets around this by first checking the input todecide whether term should be called l’he first thing that term does is call

factor, which is the only one of the proc:dures that could detect a mismatch

in the input From the grammar, we know that when factor is called, thecurrent input character must be either :L “(” or an input letter (represented

by u) This process of checking the nez- t character (without incrementing j

to decide what to do is called lookahead For some grammars, this is not

necessary; for others even more lookahead is required

Now, the implementation of factor fallows directly from the grammar If

the input character being scanned is not a “(” or an input letter, a procedure

error is called to handle the error condit on:

Trang 6

Another error condition occurs when a “)” is missing.

These procedures are obviously recursive; in fact they are so intertwinedthat they can’t be compiled in Pascal without using the forward construct

to get around the rule that a procedure can’t be used without first beingdeclared

The parse tree for a given string gives the recursive cal! structure duringparsing The reader may wish to refer to the tree above and trace throughthe operation of the above three procedures when p contains (A*B+AC)D andexpression is called with j=1 This makes the origin of the “top-down” nameobvious Such parsers are also often called recursive descent parsers becausethey move down the parse tree recursively

The top-down approach won’t work for all possible context-free mars For example, if we had the production (expression) ::= v 1 (expression)+ (term) then we would have

if p b] < > ‘+ ’ then error else

begin j:=j+l; term end

Trang 7

PARSING 275

term, we used lookahead to avoid such a loop; in this case the proper way to

get around the problem is to switch the grammar to say (term)+(expression).The occurrence of a nonterminal as the first thing on the right hand side of

a replacement rule for itself is called left recursion Actually, the problem

is more subtle, because the left recursion can arise indirectly: for example

if we were to have the productions (expression) ::= (term) and (term) ::=

v 1 (expression) + (term) Recursive descent parsers won’t work for such

grammars: they have to be transformed to equivalent grammars without leftrecursion, or some other parsing method has to be used In general, there

is an intimate and very widely studied connection between parsers and thegrammars they recognize The choice of a parsing technique is often dictated

by the characteristics of the grammar to be parsed

which can be incorporated together and combined with factor to produce a

single procedure with one true recursive call (the call to expression withinfactor)

This view leads directly to a quite simple way to check whether regularexpressions are legal Once all the procedure calls are removed, we see thateach terminal symbol is simply scanned as it is encountered The only realprocessing done is to check whether there is a right parenthesis to match eachleft parenthesis and whether each ‘I+” is followed by either a letter or a “(I’.That is, checking whether a regular expression is legal is essentially equivalent

to checking for balanced parentheses This can be simply implemented bykeeping a counter, initialized to 0, which is incremented when a left paren-thesis is encountered, decremented when a right parenthesis is encountered

If the counter is zero when the end of the expression is reached, and each ‘I+”

of the expression is followed by either a letter or a “(“, then the expressionwas legal

Of course, there is more to parsing than simply checking whether theinput string is legal: the main goal is to build the parse tree (even if in animplicit way, as in the top-down parser) for other processing It turns out to

be possible to do this with programs with the same essential structure as theparenthesis checker described in the previous paragraph One type of parser

Trang 8

276 CHAPTER 21

which works in this way is the ‘so-called shift-reduce parser The idea is tomaintain a pushdown stack which holds terminal and nonterminal symbols.Each step in the parse is either a shift step, in which the next input character

is simply pushed onto the stack, or a reduce step, in which the top characters

on the stack are matched to the right-hand side of some production in thegrammar and “reduced to” (replaced by) the nonterminal on the left side

of that production Eventually all the input characters get shifted onto thestack, and eventually the stack gets reduced to a single nonterminal symbol.The main difficulty in building a shift-reduce parser is deciding when toshift and when to reduce This can be a complicated decision, depending

on the grammar Various types of shift-reduce parsers have been studied ingreat detail, an extensive literature has been developed on them, and they arequite often preferred over recursive descent parsers because they tend to beslightly more efficient and significantly more flexible Certainly we don’t havespace here to do justice to this field, and we’ll forgo even the details of animplementation for our example

Compilers

A compiler may be thought of as a program which translates from one guage to another For example, a Pascal compiler translates programs fromthe Pascal language into the machine language of some particular computer.We’ll illustrate one way that this might be done by continuing with ourregular-expression pattern-matching example, where we wish to translatefrom the language of regular expressions to a “language” for pattern-matchingmachines, the ch, nextl, and next2 arrays of the match program of the pre-vious chapter

lan-Essentially, the translation process is “one-to-one”: for each character inthe pattern (with the exception of parentheses) we want to produce a statefor the pattern-matching machine (an entry in each of the arrays) The trick

is to keep track of the information necessary to fill in the next1 and next2arrays To do so, we’ll convert each of the procedures in our recursive descentparser into functions which create pattern-matching machines Each functionwill add new states as necessary onto the end of the ch, nextl, and next2

arrays, and return the index of the initial state of the machine created (thefinal state will always be the last entry in the arrays)

For example, the function given below for the (expression) productioncreates the “or” states for the pattern matching machine

Trang 9

This function uses a procedure setstate which simply sets the ch, nextl, andnext2 array entries indexed by the first argument to the values given in thesecond, third, and fourth arguments, respectively The index state keeps track

of the “current” state in the machine being built Each time a new state iscreated, state is simply incremented Thus, the state indices for the machinecorresponding to a particular procedure call range between the value of state

on entry and the value of state on exit The final state index is the value

of state on exit (We don’t actually “create” the final state by incrementingstate before exiting, since this makes it easy to “merge” the final state withlater initial states, as we’ll see below.)

With this convention, it is easy to check (beware of the recursive call!)that the above program implements the rule for composing two machines withthe “or” operation as diagramed in the previous chapter First the machinefor the first part of the expression is built (recursively), then two new nullstates are added and the second part of the expression built The first nullstate (with index t2 1) is the final state of the machine of the first part ofthe expression which is made into a “no-op” state to skip to the final state forthe machine for the second part of the expression, as required The secondnull state (with index t2) is the initial state, so its index is the return valuefor expression and its next1 and next2 entries are made to point to the initialstates of the two expressions Note carefully that these are constructed in theopposite order than one might expect, because the value of state for the no-opstate is not known until the recursive call to expression has been made.The function for (term) first builds the machine for a (factor) then, ifnecessary, merges the final state of that machine with the initial state of themachine for another (term) This is easier done than said, since state is thefinal state index of the call to factor A call to term without incrementingstate does the trick:

Trang 10

The function for (factor) uses similar techniques to handle its three cases:

a parenthesis calls for a recursive call on expression; a v calls for simpleconcatenation of a new state; and a * calls for operations similar to those inexpression, according to the closure diagram from the previous section:

Trang 11

PARSING 279

The final step in the development >f a general regular expression tern matching algorithm is to put these procedures together with the matchprocedure, as follows:

pat-j:==l; state:=l;

ne Ytl [0] :=expression;

setstate(state, ’ ‘, 0,O);

foI i:=l to N-l do

if match(i)>=i then writeln(i);

This program will print out all character positions in a text string a[l N]where a pattern p[l .M] leads to a match

Compiler-Compilers

The program for general regular expresr:ion pattern matching that we havedeveloped in this and the previous chapter is efficient and quite useful Aversion of this program with a few added capabilities (for handling “don’t-care” characters and other amenities) is likely to be among the most heavilyused utilities on many computer systems

It is interesting (some might say confusing) to reflect on this algorithmfrom a more philosophical point of view In this chapter, we have consideredparsers for unraveling the structure of regular expressions, based on a formaldescription of regular expressions using a context-free grammar Put anotherway, we used the context-free gramma]’ to specify a particular “pattern”:sequences of characters with legally balz.nced parentheses The parser thenchecks to see if the pattern occurs in the input (but only considers a matchlegal if it covers the entire input string) Thus parsers, which check that aninput string is in the set of strings defined by some context-free grammar,and pattern matchers, which check that an input string is in the set ofstrings defined by some regular expression, are essentially performing the samefunction! The principal difference is that context-free grammars are capable

of describing a much wider class of strings For example, the set of all regularexpressions can’t be described with regular expressions

Another difference in the way we’ve implemented the programs is that thecontext-free grammar is “built in” to the parser, while the match procedure

is “table-driven”: the same program wol,ks for all regular expressions, oncethey have been translated into the propel format It turns out to be possible

to build parsers which are table-driven In the same way, so that the sameprogram can be used to parse all language 3 which can be described by context-free grammars A parser generator is a program which takes a grammar as

input and produces a parser for the language described by that grammar as

Trang 12

280 CHAPTER 21

output This can be carried one step further: it is possible to build compilerswhich are table-driven in terms of both the input and the output languages Acompiler-compiler is a program which takes two grammars (and some formalspecification of the relationships between them) as input and produces acompiler which translates strings from one language to the other as output.Parser generators and compiler-compilers are available for general use inmany computing environments, and are quite useful tools which can be used

to produce efficient and reliable parsers and compilers with a relatively smallamount of effort On the other hand, top-down recursive descent parsers of thetype considered here are quite serviceable for simple grammars which arise inmany applications Thus, as with many of the algorithms we have considered,

we have a straightforward method which can be used for applications where

a great deal of implementation effort might not be justified, and several vanced methods which can lead to significant performance improvements forlarge-scale applications Of course, in this case, this is significantly understat-ing the point: we’ve only scratched the surface of this extensively researched

Trang 13

Give the parse tree for the regular expression ((A+B)+(Ct-D)*)*.

Extend the arithmetic expression grammar to include exponentiation, divand mod

Give a context-free grammar to d’:scribe all strings with no more thantwo consecutive 1’s

How many procedure calls are used by the recursive descent parser torecognize a regular expression in terms of the number of concatenation,

or, and closure operations and the number of parentheses?

Give the ch, next1 and next2 arrays that result from building the patternmatching machine for the pattern ((A+B)+(C+D)*)*

Modify the regular expression grammar to handle the “not” function and

Write a compiler for simple arithmetic expressions described by the mar in the text It should produce a list of ‘*instructions” for a machinecapable of three operations: Pugh the value of a variable onto a stack;

gram-add the top two values on the stick, removing them from the stack, thenputting the result there; and mt.ltiply the top two values on the stack, inthe same way

Trang 15

op-to save space are “coding” methods from information theory which were veloped to minimize the amount of information necessary in communicationssystems and therefore originally intended to save time (not space).

de-In general, most files stored on computer systems have a great deal ofredundancy The methods we will examine save space by taking advantage

of the fact that most files have a relatively low “information content.” Filecompression techniques are often used for text files (in which certain charac-ters appear much more often than others), “raster” files for encoding pictures(which can have large homogeneous areas), and files for the digital repre-sentation of sound and other analog signals (which can have large repeatedpatterns)

We’ll look at an elementary algorithm for the problem (which is still quiteuseful) and an advanced “optimal” method The amount of space saved bythese methods will vary depending on characteristics of the file Savings of20% to 50% are typical for text files, and savings of 50% to 90% might beachieved for binary files For some types of files, for example files consisting

of random bits, little can be gained In fact, it is interesting to note that anygeneral-purpose compression method must make some files longer (otherwise

we could continually apply the method to produce an arbitrarily small file)

On one hand, one might argue that file compression techniques are lessimportant than they once were because the cost of computer storage deviceshas dropped dramatically and far more storage is available to the typical userthan in the past On the other hand, it can be argued that file compression

283

Trang 16

CHAPTER 22

techniques are more important than ever because, since so much storage is inuse, the savings they make possible are greater Compression techniques arealso appropriate for storage devices which allow extremely high-speed accessand are by nature relatively expensive (and therefore small)

to encode the characters being encoded?) We’ll look at one particular method,then discuss other options

If we know that our string contains just letters, then we can encodecounts simply by interspersing digits with the letters, thus our string might

be encoded as follows:

Here “4A” means “four A’s,” and so forth Note that is is not worthwhile

to encode runs of length one or two, since two characters are needed for theencoding

For binary files (containing solely O’s and l’s), a refined version of thismethod is typically used to yield dramatic savings The idea is simply to storethe run lengths, taking advantage of the fact that the runs alternate between

0 and 1 to avoid storing the O’s and l’s themselves (This assumes that thereare few short runs, but no run-length encoding method will work very wellunless most of the runs are long.) For example, at the left in the figurebelow is a “raster” representation of the letter “q” lying on its side, which isrepresentative of the type of information that might have to be processed by atext formatting system (such as the one used to print this book); at the right

is a list of numbers which might be used to store the letter in a compressedform

Trang 17

Run-length encoding requires a separate representation for the file to beencoded and the encoded version of the file, so that it can’t work for all files.This can be quite inconvenient: for example, the character file compressionmethod suggested above won’t work for character strings that contain digits.

If other characters are used to encode the counts, it won’t work for stringsthat contain those characters To illustrate a way to encode any string from

a fixed alphabet of characters using only characters from that alphabet, we’llassume that we only have the 26 letters of the alphabet (and spaces) to workwith

How can we make some letters represent digits and others representparts of the string to be encoded? One solution is to use some characterwhich is likely to appear rarely in the text as a so-called escape character.Each appearance of that character signals that the next two letters form a(count,character) pair, with counts represented by having the ith letter ofthe alphabet represent the number i. Thus our example string would berepresented as follows with Q as the escape character:

Trang 18

286 CHAPTER 22

QDABBBAAQEBQHCDABCBAAAQDBCCCD

The combination of the escape character, the count, and the one copy

of the repeated character is called an escape sequence Note that it’s not

worthwhile to encode runs less than four characters long since at least threecharacters are required to encode any run

But what if the escape character itself happens to occur in the input?

We can’t afford to simply ignore this possibility, because it might be difficult

to ensure that any particular character can’t occur (For example, someonemight try to encode a string that has already been encoded.) One solution tothis problem is to use an escape sequence with a count of zero to represent theescape character Thus, in our example, the space character could representzero, and the escape sequence “Q(space)” would be used to represent anyoccurrence of Q in the input It is interesting to note that files which contain

Q are the only files which are made longer by this compression method If afile which has already been compressed is compressed again, it grows by atleast the number of characters equal to the number of escape sequences used.Very long runs can be encoded with multiple escape sequences Forexample, a run of 51 A’s would be encoded as QZAQYA using the conventionsabove If many very long runs are expected, it would be worthwhile to reservemore than one character to encode the counts

In practice, it is advisable to make both the compression and expansionprograms somewhat sensitive to errors This can be done by including a smallamount of redundancy in the compressed file so that the expansion programcan be tolerant of an accidental minor change to the file between compressionand expansion For example, it probably is worthwhile to put “end-of-line”characters in the compressed version of the letter “q” above, so that theexpansion program can resynchronize itself in case of an error

Run-length encoding is not particularly effective for text files because theonly character likely to be repeated is the blank, and there are simpler ways toencode repeated blanks (It was used to great advantage in the past to com-press text files created by reading in punched-card decks, which necessarilycontained many blanks.) In modern systems, repeated strings of blanks arenever entered, never stored: repeated strings of blanks at the beginning oflines are encoded as “tabs,” blanks at the ends of lines are obviated by theuse of “end-of-line” indicators A run-length encoding implementation likethe one above (but modified to handle all representable characters) saves onlyabout 4% when used on the text file for this chapter (and this savings allcomes from the letter “q” example!)

Variable-Length Encoding

In this section we’ll examine a file compression technique called Huffman

Trang 19

FILE COMPRESSION 287

encoding which can save a substantial amount of space on text files (andmany other kinds of files) The idea is to abandon the way that text files areusually stored: instead of using the usual seven or eight bits for each character,Huffman’s method uses only a few bits for characters which are used often,more bits for those which are rarely used

It will be convenient to examine how the code is used before consideringhow it is created Suppose that we wish to encode the string “ A SIMPLESTRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS.”Encoding it in our standard compact binary code with the five-bit binaryrepresentation of i representing the ith letter of the alphabet (0 for blank)gives the following bit sequence:

of bits used for the message is minimized

The first step is to count the frequency of each character within themessage to be encoded The following code fills an array count[0 26] with thefrequency counts for a message in a character array a[l M] (This programuses the index procedure described in Chapter 19 to keep the frequency countfor the ith letter of the alphabet in count[i], with count[0] used for blanks.)

for i:=O to 26 do count [i] :=O;

for i:=l to M docount[index(a[i])] := count[index(a[i])]+1;

For our example string, the count table produced is

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1 1 3 3 1 2 5 1 2 0 6 0 0 2 4 5 3 1 0 2 4 3 2 0 0 0 0 0

Trang 20

(It doesn’t matter which nodes are used if there are more than two with thesmallest frequency.) Continuing in this way, we build up larger and largersubtrees The forest of trees after all nodes with frequency 2 have been put

in is as follows:

Next, the nodes with frequency 3 are put together, creating two new nodes

of frequency 6, etc Ultimately, all the nodes are combined together into asingle tree:

Trang 21

construction.) Thus, for example, the 5 in the leftmost external node (thefrequency count for N) is stored in count [14], the 6 in the next external node

(the frequency count for I) is stored in count [9], and the 11 in the father ofthese two is stored in count[33], etc

It turns out that this structural description of the frequencies in the form

of a tree is exactly what is needed to create an efficient encoding Beforelooking at this encoding, let’s look at the code for constructing the tree.The general process involves removing the smallest from a set of unorderedelements, so we’ll use the pqdownheap procedure from Chapter 11 to build andmaintain an indirect heap on the frequency values Since we’re interested insmall values first, we’ll assume that the sense of the inequalities in pqdownheaphas been reversed One advantage of using indirection is that it is easy toignore zero frequency counts The following table shows the heap constructedfor our example:

Trang 22

290 CWTER 22

heap PI 3 7 16 21 12 15 6 20 9 4 13 14 5 2 18 19 1 0count[heap[k]] 1 2 1 2 2 3 1 3 6 2 4 5 5 3 2 4 3 11Specifically, this heap is built by first initializing the heap array to point tothe non-zero frequency counts, then using the pqdownheap procedure fromChapter 11, as follows:

N:=O;

for i:=O to 26 do

if count [i] < > 0 then

begin N:=N+I; heap[N] :=i end;

for k:=N downto 1 do pqdownheap(k);

As mentioned above, this assumes that the sense of the inequalities in thepqdownheap code has been reversed

Now, the use of this procedure to construct the tree as above is ward: we take the two smallest elements off the heap, add them and put theresult back into the heap At each step we create one new count, and decreasethe size of the heap by one This process creates N-l new counts, one foreach of the internal nodes of the tree being created, as in the following code:

of the node whose weight is in count [t] The sign of dad[t] indicates whetherthe node is a left or right son of its father For example, in the tree above

we might have dad[O]=-30, count[30]=21, dad[30]=-28, and count[28]=37

Trang 23

Now the code can be read directly from this tree The code for N is 000,the code for I is 001, the code for C is 110100, etc The following programfragment reconstructs this information from the representation of the codingtree computed during the sifting process The code is represented by twoarrays: code[k] gives the binary representation of the kth letter and len [k]

gives the number of bits from code[k] to use in the code For example, I is

the 9th letter and has code 001, so code [9]=1 and len [ 9]=3.

Trang 24

we determine when one character stops and the next begins to decode themessage? The answer is to use the radix search trie representation of thecode Starting at the root, proceed down the tree according to the bits in themessage: each time an external node is encountered, output the character atthat node and restart at the root But the tree is built at the time we encode

Trang 25

FlLE COMPRESSION 293

the message: this means that we need to save the tree along with the message

in order to decode it Fortunately, this does not present any real difficulty

It is actually necessary only to store the code array, because the radix searchtrie which results from inserting the entries from that array into an initiallyempty tree is the decoding tree

Thus, the storage savings quoted above is not entirely accurate, becausethe message can’t be decoded without the trie and we must take into accountthe cost of storing the trie (i.e., the code array) along with the message.Huffman encoding is therefore only effective for long files where the savings inthe message is enough to offset the cost, or in situations where the coding triecan be precomputed and used for a large number of messages For example, atrie based on the frequencies of occurrence of letters in the English languagecould be used for text documents For that matter, a trie based on thefrequency of occurrence of characters in Pascal programs could be used forencoding programs (for example, “;” is likely to be near the top of such atrie) A Huffman encoding algorithm saves about 23% when run on the textfor this chapter

As before, for truly random files, even this clever encoding scheme won’twork because each character will occur approximately the same number oftimes, which will lead to a fully balanced coding tree and an equal number ofbits per letter in the code

I

Trang 26

en-Could “QQ” occur somewhere in a file compressed using the methoddescribed in the text? Could “QQ&” occur?

Implement compression and expansion procedures for the binary file coding method described in the text

en-The letter “q” given in the text can be processed as a sequence of bit characters Discuss the pros and cons of doing so in order to use acharacter-based run-length encoding method

five-Draw a Huffman coding tree for the string “ABRACADABRA.” Howmany bits does the encoded message require?

What is the Huffman code for a binary file? Give an example showingthe maximum number of bits that could be used in a Huffman code for aN-character ternary (three-valued) file

Suppose that the frequencies of the occurrence of all the characters to beencoded are different Is the Huffman encoding tree unique?

Huffman coding could be extended in a straightforward way t,o encode

in two-bit characters (using 4-way trees) What would be the mainadvantage and the main disadvantage of doing so?

What would be the result of breaking up a Huffman-encoded string intofive-bit characters and Huffman encoding that string?

Implement a procedure to decode a Huffman-encoded string, given thecode and len arrays.

Trang 27

23 Cryptology

In the previous chapter we looked at methods for encoding strings ofcharacters to save space Of course, there is another very importantreason to encode strings of characters: to keep them secret

Cryptology, the study of systems for secret communications, consists oftwo competing fields of study: cryptography, the design of secret communica-

tions systems, and cryptanalysis, the study of ways to compromise secret munications systems The main application of cryptology has been in militaryand diplomatic communications systems, but other significant applicationsare becoming apparent Two principal examples are computer file systems(where each user would prefer to keep his files private) and “electronic fundstransfer” systems (where very large amounts of money are involved) A com-puter user wants to keep his computer files just as private as papers in hisfile cabinet, and a bank wants electronic funds transfer to be just as secure

com-as funds transfer by armored car

Except for military applications, we assume that cryptographers are “goodguys” and cryptanalysts are “bad guys”: our goal is to protect our computerfiles and our bank accounts from criminals If this point of view seems some-what unfriendly, it must be noted (without being over-philosophical) that byusing cryptography one is assuming the existence of unfriendliness! Of course,even “good guys” must know something about cryptanalysis, since the verybest way to be sure that a system is secure is to try to compromise it yourself.(Also, there are several documented instances of wars being brought to anend, and many lives saved, through successes in cryptanalysis.)

Cryptology has many close connections with computer science and gorithms, especially the arithmetic and string-processing algorithms that wehave studied Indeed, the art (science?) of cryptology has an intimate relation-ship with computers and computer science that is only beginning to be fullyunderstood Like algorithms, cryptosystems have been around far longer

al-295

Ngày đăng: 09/08/2014, 12:22

TỪ KHÓA LIÊN QUAN