Chapter 4 lexical and syntax analysis

Lexical Analysis cont.• The lexical analyzer is usually a function that is called by the parser when it needs the next token • The lexical analysis process also: – Includes skipping comm

Trang 1

Chapter 4

Lexical and Syntax

Analysis

Trang 3

• Language implementation systems must

analyze source code, regardless of the specific implementation approach: compilation, pure interpretation or hybrid method

• Nearly all syntax analysis is based on a formal description of the syntax of the source

language (CFG or BNF)

Trang 4

Using BNF to Describe Syntax

• Provides a clear and concise syntax

description

• The parser can be based directly on the BNF

• Parsers based on BNF are easy to maintain

Trang 5

Syntax Analysis

• The syntax analysis portion of a language

processor nearly always consists of two

parts:

– A low-level part called a lexical analyzer

(mathematically, a finite automaton based on a

regular grammar)

– A high-level part called a syntax analyzer , or

parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)

Trang 6

Reasons to Separate Lexical and Syntax Analysis

• Simplicity - less complex approaches can be

used for lexical analysis; separating them

simplifies the parser

• Efficiency - separation allows optimization of

the lexical analyzer

• Portability - parts of the lexical analyzer may

not be portable, but the parser always is

portable

Trang 7

Lexical Analysis

• A lexical analyzer is a pattern matcher for

character strings

• A lexical analyzer is a “front-end” for the parser

• Identify substrings of the source program that belong together lexemes

– Lexemes match a character pattern, which is

associated with a lexical category called a token

– sum is a lexeme; its token may be IDENT

Trang 9

Lexical Analysis (cont.)

• The lexical analyzer is usually a function that is called by the parser when it needs the next token

• The lexical analysis process also:

– Includes skipping comments, tabs, newlines, and blanks

– Inserts lexemes for user-defined names (strings, identifiers, numbers) into the symbol table

– Saves source locations (file, line, column) for error messages

– Detects and reports lexical errors in tokens, such

Trang 10

Lexical Analysis (cont.)

• Three main approaches to building a scanner:

1 Write a formal description of the tokens and use a software tool that constructs lexical analyzers

given such a description

2 Design a state diagram that describes the token patterns and write a program that implements the diagram*

3 Design a state diagram that describes the token patterns and hand-construct a table-driven

impementation of the state diagram

Trang 11

The “longest possible lexeme” rule

• The scanner returns to the parser only when the next character cannot be used to continue the

current token

– The next character will generally need to be saved for the next token

• In some cases, you may need to peek at more

than one character of look-ahead in order to

know whether to proceed

– In Pascal, when you have a 3 and you see a „.‟

• do you proceed (in hopes of getting 3.14)? or

Trang 12

• In messier cases, you may not be able to get by with any fixed amount of look-ahead In Fortran, for example, we have

DO 5 I = 1,25  loop

DO 5 I = 1.25  assignment

• Here, we need to remember we were in a

potentially final state, and save enough

information that we can back up to it, if we get stuck later

The rule …

Trang 13

State Diagram Design

• Suppose we need a lexical analyzer that only

recognizes program names, reserved words, and integer literals

• A nạve state diagram would have a transition

from every state on every character in the source language - such a diagram would be very large!

Trang 14

State Diagram Design (cont.)

• In many cases, transitions can be combined to

simplify the state diagram

– When recognizing an identifier, all uppercase and lowercase letters are equivalent - use a character

Trang 15

State Diagram Design (cont.)

• Convenient utility subprograms:

– getChar - gets the next character of input, puts

it in global variable nextChar, determines its

class and puts the class in global variable

charClass

– addChar - puts the character from nextChar into the place the lexeme (global variable) is being

accumulated

– lookup - determines whether the string in

lexeme is a reserved word (returns a code)

Trang 16

State Diagram

Trang 17

Lexical Analysis - Implementation

break;

Trang 18

Lexical Analysis - Implementation

} /* End of switch */

} /* End of function lex() */

Trang 19

A part of a Pascal scanner

• We read the characters one at a time with ahead

look-• If it is one of the one-character tokens

{ ( ) [ ] < > , ; = + - }

we announce that token

• If it is a „.‟, we look at the next character

– If that is a dot, we announce „ ‟

– Otherwise, we announce „.‟ and reuse the

look-ahead

Trang 20

• If it is a „<‟, we look at the next character

• If it is a digit, we keep reading until we find a

non-digit

– if that is not a „.‟ we announce an integer

– otherwise, we keep looking for a real number

– if the character after the „.‟ is not a digit, we announce

A part of a Pascal scanner (cont.)

Trang 21

State

Diagram

Trang 22

we skip any initial white space (spaces, tabs, and newlines)

we read the next character

if it is a ( we look at the next character

if that is a * we have a comment;

we skip forward through the terminating *)

otherwise

we return a left parenthesis and reuse the look-ahead

if it is one of the one-character tokens ([ ] , ; = + - etc.)

we return that token

if it is a we look at the next character

if that is a we return

otherwise we return and reuse the look-ahead

if it is a < we look at the next character

if that is a = we return <=

otherwise we return < and reuse the look-ahead

etc.

Trang 23

if it is a letter we keep reading letters and digits

and maybe underscores until we can‟t anymore;

then we check to see if it is a keyword

if so we return the keyword

otherwise we return an identifier

in either case we reuse the character beyond the end of

the token

if it is a digit we keep reading until we find a nondigit

if that is not a

we return an integer and reuse the nondigit

otherwise we keep looking for a real number

if the character after the is not a digit

we return an integer and reuse the and the look-ahead

etc.

Trang 24

The Parsing Problem

• Goals of the parser, given an input program:

– Find all syntax errors; for each, produce an

appropriate diagnostic message, and recover

quickly

– Produce the parse tree, or at least a trace of the parse tree, for the program

Trang 25

The Parsing Problem (cont.)

• Two categories of parsers

– Top down - produce the parse tree, beginning at the root

 Order is that of a leftmost derivation

 Traces the parse tree in preorder

– Bottom up - produce the parse tree, beginning at the leaves

 Order is that of the reverse of a rightmost derivation

• Parsers usually look only one token ahead in the input

Trang 26

The Set of Notational Conventions

• Terminal symbols – Lowercase letters at the

beginning of the alphabet (a, b, )

beginning of the alphabet (A, B, )

• Terminals or nonterminals - Uppercase letters at the end of the alphabet (W, X, Y, Z)

• Strings of terminals - Lowercase letters at the

end of the alphabet (w, x, y, z)

-Lowercase Greek letters (, , , )

Trang 27

• Top-down Parsers

– Given a sentential form, xA  , the parser must

choose the correct A-rule to get the next

sentential form in the leftmost derivation, using only the first token produced by A

• The most common top-down parsing

algorithms

– Recursive descent - a coded implementation

– LL parsers – table-driven implementation (1 st L

stands for left-to-right, 2 nd L stands for leftmost

Trang 28

• Bottom-up parsers

– Given a right sentential form,  , determine what substring of  is the RHS of the rule in the

grammar that must be reduced to produce the

previous sentential form (in the right derivation) – The most common bottom-up parsing algorithms are in the LR family

 L stands for left-to-right, R stands for rightmost derivation

Trang 29

• Consider the following grammar for a

comma-separated list of identifiers, terminated by a

semicolon

id_list  id id_list_tail

id_list_tail  , id id_list_tail

id_list_tail  ;

Trang 31

Recursive-Descent Parsing

• Recursive-Descent Process

– There is a subprogram for each nonterminal in

the grammar, which can parse sentences that can

be generated by that nonterminal

– EBNF is ideally suited for being the basis for a

recursive-descent parser, because EBNF

minimizes the number of nonterminals

• A grammar for simple expressions:

<expr>  <term> {(+ | –) <term>}

<term>  <factor> {(* | /) <factor>}

Trang 32

Recursive-Descent Parsing (cont.)

• Assume we have a lexical analyzer named lex, which puts the next token code in nextToken

• The coding process when there is only one RHS:

– For each terminal symbol in the RHS, compare it with the next input token; if they match,

continue, else there is an error

– For each nonterminal symbol in the RHS, call its associated parsing subprogram

Trang 33

Function expr()

/* Function expr()

Parses strings in the language

generated by the rule:

<expr> → <term> {(+ | -) <term>}

Trang 34

Function expr() (cont.)

/* As long as the next token is + or -, call

lex() to get the next token, and parse the

next term */

while (nextToken == PLUS_CODE ||

nextToken == MINUS_CODE) { lex();

term();

}

Trang 35

Recursive-Descent Parsing (cont.)

• A nonterminal that has more than one RHS

requires an initial process to determine which

RHS it is to parse

– The correct RHS is chosen on the basis of the next token of input

– The next token is compared with the first token

that can be generated by each RHS until a match is found

– If no match is found, it is a syntax error

Trang 37

Function factor() (cont.)

/* If the RHS is (<expr>) – call lex() to pass

over the left parenthesis, call expr(), and check for the right parenthesis */

Trang 38

The LL Grammar Class

• The Left Recursion Problem: If a grammar has

left recursion, either direct or indirect, it cannot

be the basis for a top-down parser

– A grammar can be modified to remove left

recursion

• Example: consider the following rule

A  A + B

– A recursive-descent parser subprogram for A

immediately calls itself to parse the first symbol

in its RHS …

Trang 39

Pairwise Disjointness Test

• The other characteristic of grammars that

disallows top-down parsing is the lack of pairwise disjointness

– The inability to determine the correct RHS on the basis of one token of lookahead

– FIRST(  ) = {a |   * a  } (If   *  ,  ∈ FIRST(  ))

• Pairwise Disjointness Test

– For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A  i

and A  j, it must be true that:

Trang 41

Left factoring

• This process can resolve the problem of pairwise disjointness test

• Example: consider the rules

<variable>  identifier | identifier [<expression>] – The two rules can be replace by

<variable>  identifier <new>

<new>   | [<expression>]

– or

<variable>  identifier [ [<expression>] ]

metasymbol

Trang 42

Bottom-up Parsing

• The process of bottom-up parsing produces the reverse of a rightmost derivation

• A bottom-up parser starts with the input

sentence and produces the sequence of

sentential forms from there until all that remains

is the start symbol

• In each step, the task of the bottom-up parser is finding the correct RHS in a right sentential form

to reduce to get the previous right sentential

form in the derivation

Trang 43

• The sentential form E + T * id includes three

RHSs, E + T, T, and id Only one of these is the correct one to be rewritten

– If the RHS E + T were chosen to be rewritten in this

sentential form, the resulting sentential form would be

E * id But E * id is not a legal right sentential form for

Trang 45

Example: Parser Tree of Sentential Form

E + T * id

• The phrases of the sentential form E + T * id are

E + T * id, T * id, and id

• The only simple phrase is id

• The handle of a rightmost sentential form is the

leftmost simple phrase

E

T

F

Trang 46

Example: Consider the string

(8)

Trang 48

LR Parsers

• Many different bottom-up parsing algorithms

have been devised Most of these are variations

of a process called LR parser

– L means it scans the input string left to right and the R means it produces a rightmost derivation

• The original LR algorithm was designed by

Donald Knuth (1965) This algorithm, which is sometimes called canonical LR

Trang 49

Advantages of LR parsers

• They will work for nearly all grammars that

describe programming languages

• They work on a larger class of grammars than

other bottom-up algorithms, but are as efficient

as any other bottom-up parser

• They can detect syntax errors as soon as it is

possible

• The LR class of grammars is a superset of the

class parsable by LL parsers

Trang 51

Configurations

• The contents of the parse stack for an LR parser has the following form:

S0X1S1…XmSm  top of stack

where the Si are state symbols, the Xi are grammar symbols

• An LR parsing table has two parts:

– The ACTION part has state symbols as its row

labels and the terminal symbols as its column

labels

– The GOTO part has state symbols as its row labels and the nonterminals symbols as column labels

Trang 52

Configurations (cont.)

• The input string has a „$‟ at its right end It is

used for normal termination of the parser

• An LR parser configuration is a pair of strings

(stack, input), with the detailed form

(S0X1S1…XmSm, aiai+1 … an$)

• The initial configuration of an LR parser is

(S0, a1a2 … an$)

Trang 53

The Parser Actions

Trang 54

Example: The Grammar for Arithmetic Expressions

Trang 56

A Trace of a Parse of the String id + id * id

Trang 57

– Detects syntax errors

– Produces a parse tree

• A recursive-descent parser is an LL parser

• Parsing problem for bottom-up parsers: find the

substring of current sentential form

• The LR family of shift-reduce parsers is the most

common bottom-up parsing approach

Định dạng
Số trang	57
Dung lượng	469,14 KB