Lexical Analysis cont.• The lexical analyzer is usually a function that is called by the parser when it needs the next token • The lexical analysis process also: – Includes skipping comm
Trang 1Chapter 4
Lexical and Syntax
Analysis
Trang 3• Language implementation systems must
analyze source code, regardless of the specific implementation approach: compilation, pure interpretation or hybrid method
• Nearly all syntax analysis is based on a formal description of the syntax of the source
language (CFG or BNF)
Trang 4Using BNF to Describe Syntax
• Provides a clear and concise syntax
description
• The parser can be based directly on the BNF
• Parsers based on BNF are easy to maintain
Trang 5Syntax Analysis
• The syntax analysis portion of a language
processor nearly always consists of two
parts:
– A low-level part called a lexical analyzer
(mathematically, a finite automaton based on a
regular grammar)
– A high-level part called a syntax analyzer , or
parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)
Trang 6Reasons to Separate Lexical and Syntax Analysis
• Simplicity - less complex approaches can be
used for lexical analysis; separating them
simplifies the parser
• Efficiency - separation allows optimization of
the lexical analyzer
• Portability - parts of the lexical analyzer may
not be portable, but the parser always is
portable
Trang 7Lexical Analysis
• A lexical analyzer is a pattern matcher for
character strings
• A lexical analyzer is a “front-end” for the parser
• Identify substrings of the source program that belong together lexemes
– Lexemes match a character pattern, which is
associated with a lexical category called a token
– sum is a lexeme; its token may be IDENT
Trang 9Lexical Analysis (cont.)
• The lexical analyzer is usually a function that is called by the parser when it needs the next token
• The lexical analysis process also:
– Includes skipping comments, tabs, newlines, and blanks
– Inserts lexemes for user-defined names (strings, identifiers, numbers) into the symbol table
– Saves source locations (file, line, column) for error messages
– Detects and reports lexical errors in tokens, such
Trang 10Lexical Analysis (cont.)
• Three main approaches to building a scanner:
1 Write a formal description of the tokens and use a software tool that constructs lexical analyzers
given such a description
2 Design a state diagram that describes the token patterns and write a program that implements the diagram*
3 Design a state diagram that describes the token patterns and hand-construct a table-driven
impementation of the state diagram
Trang 11The “longest possible lexeme” rule
• The scanner returns to the parser only when the next character cannot be used to continue the
current token
– The next character will generally need to be saved for the next token
• In some cases, you may need to peek at more
than one character of look-ahead in order to
know whether to proceed
– In Pascal, when you have a 3 and you see a „.‟
• do you proceed (in hopes of getting 3.14)? or
Trang 12• In messier cases, you may not be able to get by with any fixed amount of look-ahead In Fortran, for example, we have
DO 5 I = 1,25 loop
DO 5 I = 1.25 assignment
• Here, we need to remember we were in a
potentially final state, and save enough
information that we can back up to it, if we get stuck later
The rule …
Trang 13State Diagram Design
• Suppose we need a lexical analyzer that only
recognizes program names, reserved words, and integer literals
• A nạve state diagram would have a transition
from every state on every character in the source language - such a diagram would be very large!
Trang 14State Diagram Design (cont.)
• In many cases, transitions can be combined to
simplify the state diagram
– When recognizing an identifier, all uppercase and lowercase letters are equivalent - use a character
Trang 15State Diagram Design (cont.)
• Convenient utility subprograms:
– getChar - gets the next character of input, puts
it in global variable nextChar, determines its
class and puts the class in global variable
charClass
– addChar - puts the character from nextChar into the place the lexeme (global variable) is being
accumulated
– lookup - determines whether the string in
lexeme is a reserved word (returns a code)
Trang 16State Diagram
Trang 17Lexical Analysis - Implementation
break;
Trang 18Lexical Analysis - Implementation
} /* End of switch */
} /* End of function lex() */
Trang 19A part of a Pascal scanner
• We read the characters one at a time with ahead
look-• If it is one of the one-character tokens
{ ( ) [ ] < > , ; = + - }
we announce that token
• If it is a „.‟, we look at the next character
– If that is a dot, we announce „ ‟
– Otherwise, we announce „.‟ and reuse the
look-ahead
Trang 20• If it is a „<‟, we look at the next character
• If it is a digit, we keep reading until we find a
non-digit
– if that is not a „.‟ we announce an integer
– otherwise, we keep looking for a real number
– if the character after the „.‟ is not a digit, we announce
A part of a Pascal scanner (cont.)
Trang 21State
Diagram
Trang 22we skip any initial white space (spaces, tabs, and newlines)
we read the next character
if it is a ( we look at the next character
if that is a * we have a comment;
we skip forward through the terminating *)
otherwise
we return a left parenthesis and reuse the look-ahead
if it is one of the one-character tokens ([ ] , ; = + - etc.)
we return that token
if it is a we look at the next character
if that is a we return
otherwise we return and reuse the look-ahead
if it is a < we look at the next character
if that is a = we return <=
otherwise we return < and reuse the look-ahead
etc.
Trang 23if it is a letter we keep reading letters and digits
and maybe underscores until we can‟t anymore;
then we check to see if it is a keyword
if so we return the keyword
otherwise we return an identifier
in either case we reuse the character beyond the end of
the token
if it is a digit we keep reading until we find a nondigit
if that is not a
we return an integer and reuse the nondigit
otherwise we keep looking for a real number
if the character after the is not a digit
we return an integer and reuse the and the look-ahead
etc.
Trang 24The Parsing Problem
• Goals of the parser, given an input program:
– Find all syntax errors; for each, produce an
appropriate diagnostic message, and recover
quickly
– Produce the parse tree, or at least a trace of the parse tree, for the program
Trang 25The Parsing Problem (cont.)
• Two categories of parsers
– Top down - produce the parse tree, beginning at the root
Order is that of a leftmost derivation
Traces the parse tree in preorder
– Bottom up - produce the parse tree, beginning at the leaves
Order is that of the reverse of a rightmost derivation
• Parsers usually look only one token ahead in the input
Trang 26The Set of Notational Conventions
• Terminal symbols – Lowercase letters at the
beginning of the alphabet (a, b, )
beginning of the alphabet (A, B, )
• Terminals or nonterminals - Uppercase letters at the end of the alphabet (W, X, Y, Z)
• Strings of terminals - Lowercase letters at the
end of the alphabet (w, x, y, z)
-Lowercase Greek letters (, , , )
Trang 27The Parsing Problem (cont.)
• Top-down Parsers
– Given a sentential form, xA , the parser must
choose the correct A-rule to get the next
sentential form in the leftmost derivation, using only the first token produced by A
• The most common top-down parsing
algorithms
– Recursive descent - a coded implementation
– LL parsers – table-driven implementation (1 st L
stands for left-to-right, 2 nd L stands for leftmost
Trang 28The Parsing Problem (cont.)
• Bottom-up parsers
– Given a right sentential form, , determine what substring of is the RHS of the rule in the
grammar that must be reduced to produce the
previous sentential form (in the right derivation) – The most common bottom-up parsing algorithms are in the LR family
L stands for left-to-right, R stands for rightmost derivation
Trang 29• Consider the following grammar for a
comma-separated list of identifiers, terminated by a
semicolon
id_list id id_list_tail
id_list_tail , id id_list_tail
id_list_tail ;
Trang 31Recursive-Descent Parsing
• Recursive-Descent Process
– There is a subprogram for each nonterminal in
the grammar, which can parse sentences that can
be generated by that nonterminal
– EBNF is ideally suited for being the basis for a
recursive-descent parser, because EBNF
minimizes the number of nonterminals
• A grammar for simple expressions:
<expr> <term> {(+ | –) <term>}
<term> <factor> {(* | /) <factor>}
Trang 32Recursive-Descent Parsing (cont.)
• Assume we have a lexical analyzer named lex, which puts the next token code in nextToken
• The coding process when there is only one RHS:
– For each terminal symbol in the RHS, compare it with the next input token; if they match,
continue, else there is an error
– For each nonterminal symbol in the RHS, call its associated parsing subprogram
Trang 33Function expr()
/* Function expr()
Parses strings in the language
generated by the rule:
<expr> → <term> {(+ | -) <term>}
Trang 34Function expr() (cont.)
/* As long as the next token is + or -, call
lex() to get the next token, and parse the
next term */
while (nextToken == PLUS_CODE ||
nextToken == MINUS_CODE) { lex();
term();
}
}
Trang 35Recursive-Descent Parsing (cont.)
• A nonterminal that has more than one RHS
requires an initial process to determine which
RHS it is to parse
– The correct RHS is chosen on the basis of the next token of input
– The next token is compared with the first token
that can be generated by each RHS until a match is found
– If no match is found, it is a syntax error
Trang 37Function factor() (cont.)
/* If the RHS is (<expr>) – call lex() to pass
over the left parenthesis, call expr(), and check for the right parenthesis */
Trang 38The LL Grammar Class
• The Left Recursion Problem: If a grammar has
left recursion, either direct or indirect, it cannot
be the basis for a top-down parser
– A grammar can be modified to remove left
recursion
• Example: consider the following rule
A A + B
– A recursive-descent parser subprogram for A
immediately calls itself to parse the first symbol
in its RHS …
Trang 39Pairwise Disjointness Test
• The other characteristic of grammars that
disallows top-down parsing is the lack of pairwise disjointness
– The inability to determine the correct RHS on the basis of one token of lookahead
– FIRST( ) = {a | * a } (If * , ∈ FIRST( ))
• Pairwise Disjointness Test
– For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A i
and A j, it must be true that:
Trang 41Left factoring
• This process can resolve the problem of pairwise disjointness test
• Example: consider the rules
<variable> identifier | identifier [<expression>] – The two rules can be replace by
<variable> identifier <new>
<new> | [<expression>]
– or
<variable> identifier [ [<expression>] ]
metasymbol
Trang 42Bottom-up Parsing
• The process of bottom-up parsing produces the reverse of a rightmost derivation
• A bottom-up parser starts with the input
sentence and produces the sequence of
sentential forms from there until all that remains
is the start symbol
• In each step, the task of the bottom-up parser is finding the correct RHS in a right sentential form
to reduce to get the previous right sentential
form in the derivation
Trang 43• The sentential form E + T * id includes three
RHSs, E + T, T, and id Only one of these is the correct one to be rewritten
– If the RHS E + T were chosen to be rewritten in this
sentential form, the resulting sentential form would be
E * id But E * id is not a legal right sentential form for
Trang 45Example: Parser Tree of Sentential Form
E + T * id
• The phrases of the sentential form E + T * id are
E + T * id, T * id, and id
• The only simple phrase is id
• The handle of a rightmost sentential form is the
leftmost simple phrase
E
T
F
Trang 46Example: Consider the string
(8)
Trang 48LR Parsers
• Many different bottom-up parsing algorithms
have been devised Most of these are variations
of a process called LR parser
– L means it scans the input string left to right and the R means it produces a rightmost derivation
• The original LR algorithm was designed by
Donald Knuth (1965) This algorithm, which is sometimes called canonical LR
Trang 49Advantages of LR parsers
• They will work for nearly all grammars that
describe programming languages
• They work on a larger class of grammars than
other bottom-up algorithms, but are as efficient
as any other bottom-up parser
• They can detect syntax errors as soon as it is
possible
• The LR class of grammars is a superset of the
class parsable by LL parsers
Trang 51Configurations
• The contents of the parse stack for an LR parser has the following form:
S0X1S1…XmSm top of stack
where the Si are state symbols, the Xi are grammar symbols
• An LR parsing table has two parts:
– The ACTION part has state symbols as its row
labels and the terminal symbols as its column
labels
– The GOTO part has state symbols as its row labels and the nonterminals symbols as column labels
Trang 52Configurations (cont.)
• The input string has a „$‟ at its right end It is
used for normal termination of the parser
• An LR parser configuration is a pair of strings
(stack, input), with the detailed form
(S0X1S1…XmSm, aiai+1 … an$)
• The initial configuration of an LR parser is
(S0, a1a2 … an$)
Trang 53The Parser Actions
Trang 54Example: The Grammar for Arithmetic Expressions
Trang 56A Trace of a Parse of the String id + id * id
Trang 57– Detects syntax errors
– Produces a parse tree
• A recursive-descent parser is an LL parser
• Parsing problem for bottom-up parsers: find the
substring of current sentential form
• The LR family of shift-reduce parsers is the most
common bottom-up parsing approach