procedure expression;begin term ; if plj]=‘+’ then begin j:=j+ 1; expression end end ; An array p contains the regular expre:;sion being parsed, with an index j pointing to the character
Trang 1procedure expression;
begin
term ;
if plj]=‘+’ then begin j:=j+ 1; expression end end ;
An array p contains the regular expre:;sion being parsed, with an index j pointing to the character currently begin examined To parse a given regular expression, we put it in p[l M], (with a sentinel character in p[M+l] which
is not used in the grammar) set j to 1, and call expression If this results in
j being set to M+1, then the regular ex 3ression is in the language described
by the grammar Otherwise, we’ll see below how various error conditions are handled
The first thing that expression does is call term, which has a slightly more
complicated implementation:
procedure term ;
begin
fact x-;
if (1: b]=‘( ‘) or letter(ptj]) then term; end
A direct translation from the grammar would simply have term call factor
and then term This obviously won’t work because it leaves no way to exit from term: this program would go into an infinite recursive loop if called (Such loops have particularly unpleasant effects in many systems.) The implementation above gets around this by first checking the input to decide whether term should be called l’he first thing that term does is call
factor, which is the only one of the proc:dures that could detect a mismatch
in the input From the grammar, we know that when factor is called, the current input character must be either :L “(” or an input letter (represented
by u) This process of checking the nez- t character (without incrementing j
to decide what to do is called lookahead For some grammars, this is not
necessary; for others even more lookahead is required
Now, the implementation of factor fallows directly from the grammar If
the input character being scanned is not a “(” or an input letter, a procedure
error is called to handle the error condit on:
Trang 2procedure factor;
begin
if pb]=‘(‘then
begin
j:=j+l;
expression ;
if p b] = ‘) ’ then j : =j+ 1 else error
end
else if letter(plj]) then j:=j+l else error;
if pb]=‘*‘then j:=j+l;
end ;
Another error condition occurs when a “)” is missing
These procedures are obviously recursive; in fact they are so intertwined that they can’t be compiled in Pascal without using the forward construct
to get around the rule that a procedure can’t be used without first being declared
The parse tree for a given string gives the recursive cal! structure during parsing The reader may wish to refer to the tree above and trace through the operation of the above three procedures when p contains (A*B+AC)D and expression is called with j=1 This makes the origin of the “top-down” name obvious Such parsers are also often called recursive descent parsers because they move down the parse tree recursively
The top-down approach won’t work for all possible context-free gram-mars For example, if we had the production (expression) ::= v 1 (expression) + (term) then we would have
procedure badexpression ;
begin
if letter(pb]) then j:=j+l else
begin
badexpression ;
if p b] < > ‘+ ’ then error else
begin j:=j+l; term end
end
end ;
If this procedure were called with plj] a nonletter (as in our example, for j=l) then it would go into an infinite recursive loop Avoiding such loops is
a principal difficulty in the implementation of recursive descent parsers For
Trang 3term, we used lookahead to avoid such a loop; in this case the proper way to
get around the problem is to switch the grammar to say (term)+(expression) The occurrence of a nonterminal as the first thing on the right hand side of
a replacement rule for itself is called left recursion Actually, the problem
is more subtle, because the left recursion can arise indirectly: for example
if we were to have the productions (expression) ::= (term) and (term) ::=
v 1 (expression) + (term) Recursive descent parsers won’t work for such
grammars: they have to be transformed to equivalent grammars without left recursion, or some other parsing method has to be used In general, there
is an intimate and very widely studied connection between parsers and the grammars they recognize The choice of a parsing technique is often dictated
by the characteristics of the grammar to be parsed
Bottom- Up Parsing
Though there are several recursive calls in the programs above, it is an in-structive exercise to remove the recursion systematically Recall from Chapter
9 (where we removed the recursion from Quicksort) that each procedure call can be replaced by a stack push and each procedure return by a stack pop, mimicking what the Pascal system does to implement recursion A reason for doing this is that many of the calls which seem recursive are not truly recursive When a procedure call is the last action of a procedure, then a simple goto can be used This turns expression and term into simple loops,
which can be incorporated together and combined with factor to produce a
single procedure with one true recursive call (the call to expression within factor)
This view leads directly to a quite simple way to check whether regular expressions are legal Once all the procedure calls are removed, we see that each terminal symbol is simply scanned as it is encountered The only real processing done is to check whether there is a right parenthesis to match each left parenthesis and whether each ‘I+” is followed by either a letter or a “(I’ That is, checking whether a regular expression is legal is essentially equivalent
to checking for balanced parentheses This can be simply implemented by keeping a counter, initialized to 0, which is incremented when a left paren-thesis is encountered, decremented when a right parenparen-thesis is encountered
If the counter is zero when the end of the expression is reached, and each ‘I+”
of the expression is followed by either a letter or a “(“, then the expression was legal
Of course, there is more to parsing than simply checking whether the input string is legal: the main goal is to build the parse tree (even if in an implicit way, as in the top-down parser) for other processing It turns out to
be possible to do this with programs with the same essential structure as the parenthesis checker described in the previous paragraph One type of parser
Trang 4which works in this way is the ‘so-called shift-reduce parser The idea is to maintain a pushdown stack which holds terminal and nonterminal symbols Each step in the parse is either a shift step, in which the next input character
is simply pushed onto the stack, or a reduce step, in which the top characters
on the stack are matched to the right-hand side of some production in the grammar and “reduced to” (replaced by) the nonterminal on the left side
of that production Eventually all the input characters get shifted onto the stack, and eventually the stack gets reduced to a single nonterminal symbol The main difficulty in building a shift-reduce parser is deciding when to shift and when to reduce This can be a complicated decision, depending
on the grammar Various types of shift-reduce parsers have been studied in great detail, an extensive literature has been developed on them, and they are quite often preferred over recursive descent parsers because they tend to be slightly more efficient and significantly more flexible Certainly we don’t have space here to do justice to this field, and we’ll forgo even the details of an implementation for our example
Compilers
A compiler may be thought of as a program which translates from one lan-guage to another For example, a Pascal compiler translates programs from the Pascal language into the machine language of some particular computer We’ll illustrate one way that this might be done by continuing with our regular-expression pattern-matching example, where we wish to translate from the language of regular expressions to a “language” for pattern-matching machines, the ch, nextl, and next2 arrays of the match program of the pre-vious chapter
Essentially, the translation process is “one-to-one”: for each character in the pattern (with the exception of parentheses) we want to produce a state for the pattern-matching machine (an entry in each of the arrays) The trick
is to keep track of the information necessary to fill in the next1 and next2 arrays To do so, we’ll convert each of the procedures in our recursive descent parser into functions which create pattern-matching machines Each function will add new states as necessary onto the end of the ch, nextl, and next2
arrays, and return the index of the initial state of the machine created (the final state will always be the last entry in the arrays)
For example, the function given below for the (expression) production creates the “or” states for the pattern matching machine
Trang 5function expression : integer;
var tl, t2: integer;
begin
tl : = term ; expression : = tl ;
if plj]=‘+’ then begin
j:=j+l; state:=state+I;
t2:=state; expression:=t2; state:=state+l; setstate(t2, ’ ‘, expression, tl ) ;
setstate(t2-I, ’ ‘, state, state);
end ; end ;
This function uses a procedure setstate which simply sets the ch, nextl, and next2 array entries indexed by the first argument to the values given in the second, third, and fourth arguments, respectively The index state keeps track
of the “current” state in the machine being built Each time a new state is created, state is simply incremented Thus, the state indices for the machine corresponding to a particular procedure call range between the value of state
on entry and the value of state on exit The final state index is the value
of state on exit (We don’t actually “create” the final state by incrementing state before exiting, since this makes it easy to “merge” the final state with later initial states, as we’ll see below.)
With this convention, it is easy to check (beware of the recursive call!) that the above program implements the rule for composing two machines with the “or” operation as diagramed in the previous chapter First the machine for the first part of the expression is built (recursively), then two new null states are added and the second part of the expression built The first null state (with index t2 1) is the final state of the machine of the first part of the expression which is made into a “no-op” state to skip to the final state for the machine for the second part of the expression, as required The second null state (with index t2) is the initial state, so its index is the return value for expression and its next1 and next2 entries are made to point to the initial states of the two expressions Note carefully that these are constructed in the opposite order than one might expect, because the value of state for the no-op state is not known until the recursive call to expression has been made The function for (term) first builds the machine for a (factor) then, if necessary, merges the final state of that machine with the initial state of the machine for another (term) This is easier done than said, since state is the final state index of the call to factor A call to term without incrementing state does the trick:
Trang 6function term ;
var t: integer;
begin
term :=factor;
if (pb]=‘(‘) or letter(p[j]) then t:=term
end ;
(We have no use for the initial state index returned by the second call to term, but Pascal requires us to put it, somewhere, so we throw it away in a temporary variable t.)
The function for (factor) uses similar techniques to handle its three cases:
a parenthesis calls for a recursive call on expression; a v calls for simple concatenation of a new state; and a * calls for operations similar to those in expression, according to the closure diagram from the previous section:
function factor;
var tl, t2: integer;
begin
tl :=state;
if plj]=‘(‘then
begin
j:=j+l; t2:=expression;
if p b] = ‘) ’ then j := j+ 1 else error
end
else if letter(pb]) then
begin
setstate(state,plj], state+l, 0);
t2:=state; j:=j+l; state:=state+I
end
else error;
if p[j]<>‘*‘then factor:=t2 else
begin
setstate(state, ’ ‘, state+l, t2);
factor:=state; next1 [tl-I] :=state;
j:=j+l; state:=state+l;
end ;
end ;
The reader may find it instructive to trace through the construction of the machine for the pattern (A*B+AC)D given in the previous chapter
Trang 7The final step in the development >f a general regular expression pat-tern matching algorithm is to put these procedures together with the match procedure, as follows:
j:==l; state:=l;
ne Ytl [0] :=expression;
setstate(state, ’ ‘, 0,O);
foI i:=l to N-l do
if match(i)>=i then writeln(i);
This program will print out all character positions in a text string a[l N] where a pattern p[l .M] leads to a match
Compiler-Compilers
The program for general regular expresr:ion pattern matching that we have developed in this and the previous chapter is efficient and quite useful A version of this program with a few added capabilities (for handling “don’t-care” characters and other amenities) is likely to be among the most heavily used utilities on many computer systems
It is interesting (some might say confusing) to reflect on this algorithm from a more philosophical point of view In this chapter, we have considered parsers for unraveling the structure of regular expressions, based on a formal description of regular expressions using a context-free grammar Put another way, we used the context-free gramma]’ to specify a particular “pattern”: sequences of characters with legally balz.nced parentheses The parser then checks to see if the pattern occurs in the input (but only considers a match legal if it covers the entire input string) Thus parsers, which check that an input string is in the set of strings defined by some context-free grammar, and pattern matchers, which check that an input string is in the set of strings defined by some regular expression, are essentially performing the same function! The principal difference is that context-free grammars are capable
of describing a much wider class of strings For example, the set of all regular expressions can’t be described with regular expressions
Another difference in the way we’ve implemented the programs is that the context-free grammar is “built in” to the parser, while the match procedure
is “table-driven”: the same program wol,ks for all regular expressions, once they have been translated into the propel format It turns out to be possible
to build parsers which are table-driven In the same way, so that the same program can be used to parse all language 3 which can be described by context-free grammars A parser generator is a program which takes a grammar as
input and produces a parser for the language described by that grammar as
Trang 8output This can be carried one step further: it is possible to build compilers which are table-driven in terms of both the input and the output languages A compiler-compiler is a program which takes two grammars (and some formal specification of the relationships between them) as input and produces a compiler which translates strings from one language to the other as output Parser generators and compiler-compilers are available for general use in many computing environments, and are quite useful tools which can be used
to produce efficient and reliable parsers and compilers with a relatively small amount of effort On the other hand, top-down recursive descent parsers of the type considered here are quite serviceable for simple grammars which arise in many applications Thus, as with many of the algorithms we have considered,
we have a straightforward method which can be used for applications where
a great deal of implementation effort might not be justified, and several ad-vanced methods which can lead to significant performance improvements for large-scale applications Of course, in this case, this is significantly understat-ing the point: we’ve only scratched the surface of this extensively researched
Trang 91
2
3
4
5
6
7
8
9
10
How does the recursive descent parser find an error in a regular expression such as (A+B)*BC+ which is incomplete?
Give the parse tree for the regular expression ((A+B)+(Ct-D)*)*
Extend the arithmetic expression grammar to include exponentiation, div and mod
Give a context-free grammar to d’:scribe all strings with no more than two consecutive 1’s
How many procedure calls are used by the recursive descent parser to recognize a regular expression in terms of the number of concatenation,
or, and closure operations and the number of parentheses?
Give the ch, next1 and next2 arrays that result from building the pattern matching machine for the pattern ((A+B)+(C+D)*)*
Modify the regular expression grammar to handle the “not” function and
“don’t-care” characters
Build a general regular expression pattern matcher based on the improved grammar in your answer to the previous question
Remove the recursion from the recursive descent compiler, and simplify the resulting code as much as possible Compare the running time of the nonrecursive and recursive methods
Write a compiler for simple arithmetic expressions described by the gram-mar in the text It should produce a list of ‘*instructions” for a machine capable of three operations: Pugh the value of a variable onto a stack;
add the top two values on the stick, removing them from the stack, then putting the result there; and mt.ltiply the top two values on the stack, in the same way