For the moment, we’ll assume that this translation has been done, so that we have available the ch, nextl, and next2 arrays representing a particular nondeterministic machine which corre
Trang 1PATTERN MATCHING
is the number of the actual initial state (Note the special representation used for null states with 0 or 1 exits.)
Since we often will want to access states just by number, the most suitable organization for the machine is to use the array representation We’ll use the three arrays
ch: amty [O Mmax] of char;
nextl, next2: array [O Mmax] of integer;
Here Mmax is the maximum number of’ states (twice the maximum pattern length) It would be possible to get by with two-thirds this amount of space, since each state really uses only two rreaningful pieces of information, but we’ll forsake this improvement for the sake of clarity and also because pattern descriptions are not likely to be particularly long
We’ve seen how to build up mach.nes from regular expression pattern descriptions and how such machines might be represented as arrays However,
to write a program to do the translation from a regular expression to the corresponding nondeterministic machine representation automatically is quite another matter In fact, even writing a program to determine if a given regular expression is legal is challenging for the uninitiated In the next chapter, we’ll study this operation, called parsing, in much more detail For the moment, we’ll assume that this translation has been done, so that we have available the ch, nextl, and next2 arrays representing a particular nondeterministic
machine which corresponds to the regular expression pattern description of interest
Simulating the Machine
The last step in the development of a general regular-expression pattern-matching algorithm is to write a program which somehow simulates the opera-tion of a nondeterministic pattern-matching machine The idea of writing a program which can “guess” the right answer seems ridiculous However, in this case it turns out that we can keep track of all possible matches in a
systematic way, so that we do eventually encounter the correct one
One possibility would be to develop a recursive program which mimics the nondeterministic machine (but tries all possibilities rather than guessing the right one) Instead of using this approach, we’ll look at a nonrecursive implementation which exposes the basic operating principles of the method
by keeping the states under consideration in a rather peculiar data structure called a deque, described in some detail below.
The idea is to keep track of all states that could possibly be encountered while the machine is “looking at” the c:lrrent input character Each of these
Trang 2states are processed in turn: null states lead to two (or fewer) states, states for characters which do not match the current input are eliminated, and states for characters which do match the current input lead to new states for use when the machine is looking at the next input character Thus, we maintain
a list of all the states that the nondeterministic machine could possibly be in
at a particular point in the text: the problem is to design an appropriate data structure for this list
Processing null states seems to require a stack, since we are essentially
postponing one of two things to be done, just as when we removed the recursion from Quicksort (so the new state should be put at the beginning
of the current list, lest it get postponed indefinitely) Processing the other states seems to require a queue, since we don’t want to examine states for the next input character until we’ve finished with the current character (so the new state should be put at the end of the current list) Rather than choosing between these two data structures, we’ll use both! Deques (“double-ended queues”) combine the features of stacks and queues: a deque is a list to which items can be added at either end (Actually, we use an “output-restricted deque,” since we always remove items from the beginning, not the end: that would be “dealing from the bottom of the deck.“)
A crucial property of the machine is that there are no “loops” consisting of just null states: otherwise it could decide nondeterministically to loop forever
It turns out that this implies that the number of states on the deque at any time is less than the number of characters in the pattern description
The program given below uses a deque to simulate the actions of a non-deterministic pattern-matching machine as described above While examin-ing a particular character in the input, the nondeterministic machine can be
in any one of several possible states: the program keeps track of these in
a deque dq One pointer (head) to the head of the deque is maintained so that items can be inserted or removed at the beginning, and another pointer (tail) to the tail of the deque is maintained so that items can be inserted
at the end If the pattern description has M characters the deque can be
implemented in a “circular” manner in an array of M integers The
con-tents of the deque are the elements “between” head and tail (inclusive): if head<=tail, the meaning is obvious; if head>tail we take the elements that would fall between head and tail if the elements of dq were arranged in a circle: dq[head], dq[head+l], .,dq[M-l],dq[O], dq[l], ,dq[tail] This is quite simply implemented by using head:= head+1 mod M to increment head and similarly for tail Similarly, head:= head+M-1 mod M refers to the ele-ment before head in the rrray: this is the position at which an element should
be added to the beginning of the deque
The main loop of the program removes a state from the deque (by
Trang 3PATTERN MATCHING 265
incrementing head mod M and then referring to dq[head]) and performs the action required If a character is to be matched, the input is checked for the required character: if it is found, the sate transition is effected by putting the new state at the end of the deque (so that all states involving the current character are processed before those involving the next one) If the state is null, the two possible states to be simulated are put at the beginning of the deque The states involving the curren, input character are kept separated from those involving the next by a marker scan=-1 in the deque: when scan is encountered, the pointer into th,: input string is advanced The loop terminates when the end of the input is reached (no match found), state 0 is reached (legal match found), or only one item, the scan marker is left on the deque (no match found) This leads directly to the following implementation:
function match(j: intege.-): integer;
const scan=- 1;
var head, tail, nl, n2: integer;
dq: array [O Mmax] of integer;
procedure addhead(x: integer);
begin dq[head] := x; head:=(head+M-1) mod A4 end; procedure addtail(x: integer);
begin tail:=(tail+l) mod M; dq[tail]:=x end;
begin head:=l; taiJ:=O;
addtail(next1 [O]); addtail(scan);
match:=j-1;
repeat
if dq [head] =scan thfsn begin j:=j+l; addtail(scan) end else if ch [dq[head]]==alj] then addtail(next1 [dq[head]]) else if ch[dq[head]]==’ ‘then begin
nl :=nextl [dq[her!d]] ; n2:=next2[dq[head]];
addhead(n1); if r’l<>n2 then addhead(n2) end ;
head:=(head+l) mod M until (j>N) or (dq[head]=O) or (head=tail);
if dq[head]=O then match:=j-1;
end ;
This function takes as its argument the -1osition j in the text string a at which
Trang 4it should start trying to match It returns the index of the last character in the match found (if any, otherwise it returns j-1)
The following table shows the contents of the deque each time a state is removed when our sample machine is run with the text string AABD (For clarity, the details involving head, tail, and the maintenance of the circular deque are suppressed in this table: each line shows those elements in the deque between the head and tail pointers.) The characters appear in the lefthand column in the table at the point when the program has finished scanning them
5 scan
6 s c a n
A s c a n 2
7 scan
A s c a n 2
2 scan
3 scan
B s c a n 4
4 s c a n
8 scan
D s c a n 9
9 scan
0 s c a n
s c a n
6 scan
2 7
s c a n
7 s c a n
s c a n 2 2
scan
Thus, we start with State 5 while scanning the first character First State 5 leads to States 2 and 6, then State 2 leads to States 1 and 3, all of which need
to scan the same character and are on the beginning of the deque Then State
1 leads to State 2, but at the end of the deque (for the next input character) State 3 only leads to another state while scanning a B, so it is ignored while
an A is being scanned When the “scan” sentinel finally reaches the front of the deque, we see that the machine could be either in State 2 or State 7 after scanning an A Continuing, the program eventually ends up the final state, after considering all transitions consistent with the text string
Trang 5PATTERN MATCHING
The running time of this program obviously depends very heavily on the pattern being matched However, for each of the N input characters, it processes at most M states of the mac:nne, so the worst case running time
is proportional to MN For sure, not all nondeterministic machines can be simulated so efficiently, as discussed in more detail in Chapter 40, but the use
of a simple hypothetical pattern-matching machine in this application leads
to a quite reasonable algorithm for a quite difficult problem However, to complete the algorithm, we need a program which translates arbitrary regular expressions into “machines” for interpretation by the above code In the next chapter, we’ll look at the implementation of such a program in the context of
a more general discussion of compilers a,nd parsing techniques
r - l
Trang 61 Give a regular expression for recognizing all occurrences of four or fewer consecutive l’s in a binary string
2 Draw the nondeterministic pattern matching machine for the pattern description (A+B)* +C
3 Give the state transitions your machine from the previous exercise would make to recognize ABBAC
4 Explain how you would modify the nondeterministic machine to handle the “not” function
5 Explain how you would modify the nondeterministic machine to handle
“don’t-care” characters
6 What would happen if match were to try to simulate the following ma-chine?
7 Modify match to handle regular expressions with the “not” function and
“don’t-care” characters
8 Show how to construct a pattern description of length M and a text
string of length N for which the running time of match is as large as possible
9 Why must the deque in match have only one “scan” sentinel in it?
10 Show the contents of the deque each time a state is removed when match
is used to simulate the example machine in the text with the text string ACD
Trang 721 Parsing
Several fundamental algorithms have been developed to recognize legal computer programs and to decomI:ose their structure into a form suitable for further processing This operation, called parsing, has application beyond computer science, since it is directly related to the study of the structure
of language in general For example, parsing plays an important role in sys-tems which try to “understand” natural (human) languages and in syssys-tems for translating from one language to another One particular case of inter-est is translating from a “high-level” co.nputer language like Pascal (suitable for human use) to a “low-level” assembly or machine language (suitable for machine execution) A program for doing such a translation is called a com-piler
Two general approaches are used for parsing Top-down methods look for a legal program by first looking for parts of a legal program, then looking for parts of parts, etc until the pieces are small enough to match the input directly Bottom-up methods put pieces of the input together in a structured way making bigger and bigger pieces until a legal program is constructed
In general, top-down methods are recursive, bottom-up methods are iterative; top-down methods are thought to be easier to implement, bottom-up methods are thought to be more efficient
A full treatment of the issues involved in parser and compiler construction would clearly be beyond the scope of thi>, book However, by building a simple
“compiler” to complete the pattern-mats:hing algorithm of the previous chap-ter, we will be able to consider some of’ the fundamental concepts involved First we’ll construct a top-down parser for a simple language for describing regular expressions Then we’ll modify the parser to make a program which translates regular expressions into pattern-matching machines for use by the match procedure of the previous chapter
Our intent in this chapter is to give some feeling for the basic principles
269
Trang 8of parsing and compiling while at the same time developing a useful pattern matching algorithm Certainly we cannot treat the issues involved at the level of depth that they deserve The reader should be warned that subtle difficulties are likely to arise in applying the same approach to similar prob-lems, and advised that compiler construction is a quite well-developed field with a variety of advanced methods available for serious applications
Context-Free Grammars
Before we can write a program to determine whether a program written in
a given language is legal, we need a description of exactly what constitutes
a legal program This description is called a grammar: to appreciate the ter-minology, think of the language as English and read “sentence” for “program”
in the previous sentence (except for the first occurrence!) Programming lan-guages are often described by a particular type of grammar called a
context-free grammar For example, the context-context-free grammar which defines the set
of all legal regular expressions (as described in the previous chapter) is given below
(expression) : : = (term) 1 (term) + (expression) (term) ::= (factor) 1 (factor)(term)
(factor) ::= ((expression)) ( 21 1 (factor)*
This grammar describes regular expressions like those that we used in the last chapter, such as (l+Ol)*(O+l) or (A*B+AC)D Each line in the grammar is called a production or replacement rule The productions consist of terminal
symbols (, ), + and * which are the symbols used in the language being described (‘91,” a special symbol, stands for any letter or digit); nonterminal symbols (expression), (term), and (factor) which are internal to the grammar;
and metasymbols I:= and ( which are used to describe the meaning of the
productions The ::= symbol, which may be read 2s a,” defines the left-hand side of the production in terms of the right-hand side; and the 1 symbol, which may be read as “or” indicates alternative choices The various productions, though expressed in this concise symbolic notation, correspond in a simple way to an intuitive description of the grammar For example, the second production in the example grammar might be read “a (term) is a (factor)
or a (factor) followed by a (term).” One nonterminal symbol, in this case (expreswon), is distinguished in the sense that a string of terminal symbols is
in the language described by the grammar if and only if there is some way to use the productions to derive that string from the distinguished nonterminal
by replacing (in any number of steps) a nonterminal symbol by any of the “or” clauses on the right-hand side of a production for that nonterminal symbol
Trang 9PARSING 271
One natural way to describe the result of this derivation process is called
a purse tree: a diagram of the complete grammatical structure of the string being parsed For example, the following parse tree shows that the string (A*B+AC)D is in the language described by the above grammar
The circled internal nodes labeled E, F, a.nd T represent (expression), (factor), and (term), respectively Parse trees like this are sometimes used for English,
to break down a “sentence” into “subject,” “verb,” “object,” etc
The main function of a parser is to accept strings which can be so derived and reject those that cannot, by attempting to construct a parse tree for any given string That is, the parser can recognize whether a string is in the language described by the grammar by determining whether or not there exists a parse tree for the string Top-down parsers do so by building the tree starting with the distinguished nonterminal at the top, working down towards the string to be recognized at the bottom; bottom-up parsers do this
by starting with the string at the bottom, working backwards up towards the distinguished nonterminal at the top
As we’ll see, if the strings being reo>gnized also have meanings implying further processing, then the parser can convert them into an internal repre-sentation which can facilitate such processing
Another example of a context-free grammar may be found in the appen-dix of the Pascal User Manual and Report: it describes legal Pascal programs.
The principles considered in this section for recognizing and using legal ex-pressions apply directly to the complex job of compiling and executing Pascal
Trang 10programs For example, the following grammar describes a very small subset
of Pascal, arithmetic expressions involving addition and multiplication
(expression) ::= (term) 1 (term) + (expression) (term) ::= (factor) 1 (factor)* (term) (factor) ::= ((expression)) ) 21
Again, w is a special symbol which stands for any letter, but in this grammar the letters are likely to represent variables with numeric values Examples of legal strings for this grammar are A+(B*C) and (A+B*C)*D*(A+(B+C))
As we have defined things, some strings are perfectly legal both as arith-metic expressions and as regular expressions For example, A*(B+C) might mean “add B to C and multiply the result by A” or “take any number of A’s followed by either B or C.” This points out the obvious fact that checking whether a string is legally formed is one thing, but understanding what it means is quite another We’ll return to this issue after we’ve seen how to parse a string to check whether or not it is described by some grammar Each regular expression is itself an example of a context-free grammar: any language which can be described by a regular expression can also be described by a context-free grammar The converse is not true: for example, the concept of “balancing” parentheses can’t be captured with regular ex-pressions Other types of grammars can describe languages which can’t be described by context-free grammars For example, context-sensitive grammars are the same as those above except that the left-hand sides of productions need not be single nonterminals The differences between classes of languages and a hierarchy of grammars for describing them have been very carefully worked out and form a beautiful theory which lies at the heart of computer science
Top-Down Parsing
One parsing method uses recursion to recognize strings from the language described exactly as specified by the grammar Put simply, the grammar is such a complete specification of the language that it can be turned directly into a program!
Each production corresponds to a procedure with the name of the non-terminal on the left-hand side Nonnon-terminals on the right-hand side of the input correspond to (possibly recursive) procedure calls; terminals correspond
to scanning the input string For example, the following procedure is part of
a top-down parser for our regular expression grammar: