1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Chapter 4 lexical and syntax analysis

57 9 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 57
Dung lượng 469,14 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Lexical Analysis cont.• The lexical analyzer is usually a function that is called by the parser when it needs the next token • The lexical analysis process also: – Includes skipping comm

Trang 1

Chapter 4

Lexical and Syntax

Analysis

Trang 3

• Language implementation systems must

analyze source code, regardless of the specific implementation approach: compilation, pure interpretation or hybrid method

• Nearly all syntax analysis is based on a formal description of the syntax of the source

language (CFG or BNF)

Trang 4

Using BNF to Describe Syntax

• Provides a clear and concise syntax

description

• The parser can be based directly on the BNF

• Parsers based on BNF are easy to maintain

Trang 5

Syntax Analysis

• The syntax analysis portion of a language

processor nearly always consists of two

parts:

– A low-level part called a lexical analyzer

(mathematically, a finite automaton based on a

regular grammar)

– A high-level part called a syntax analyzer , or

parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)

Trang 6

Reasons to Separate Lexical and Syntax Analysis

• Simplicity - less complex approaches can be

used for lexical analysis; separating them

simplifies the parser

• Efficiency - separation allows optimization of

the lexical analyzer

• Portability - parts of the lexical analyzer may

not be portable, but the parser always is

portable

Trang 7

Lexical Analysis

• A lexical analyzer is a pattern matcher for

character strings

• A lexical analyzer is a “front-end” for the parser

• Identify substrings of the source program that belong together lexemes

– Lexemes match a character pattern, which is

associated with a lexical category called a token

– sum is a lexeme; its token may be IDENT

Trang 9

Lexical Analysis (cont.)

• The lexical analyzer is usually a function that is called by the parser when it needs the next token

• The lexical analysis process also:

– Includes skipping comments, tabs, newlines, and blanks

– Inserts lexemes for user-defined names (strings, identifiers, numbers) into the symbol table

– Saves source locations (file, line, column) for error messages

– Detects and reports lexical errors in tokens, such

Trang 10

Lexical Analysis (cont.)

• Three main approaches to building a scanner:

1 Write a formal description of the tokens and use a software tool that constructs lexical analyzers

given such a description

2 Design a state diagram that describes the token patterns and write a program that implements the diagram*

3 Design a state diagram that describes the token patterns and hand-construct a table-driven

impementation of the state diagram

Trang 11

The “longest possible lexeme” rule

• The scanner returns to the parser only when the next character cannot be used to continue the

current token

– The next character will generally need to be saved for the next token

• In some cases, you may need to peek at more

than one character of look-ahead in order to

know whether to proceed

– In Pascal, when you have a 3 and you see a „.‟

• do you proceed (in hopes of getting 3.14)? or

Trang 12

• In messier cases, you may not be able to get by with any fixed amount of look-ahead In Fortran, for example, we have

DO 5 I = 1,25loop

DO 5 I = 1.25assignment

• Here, we need to remember we were in a

potentially final state, and save enough

information that we can back up to it, if we get stuck later

The rule …

Trang 13

State Diagram Design

• Suppose we need a lexical analyzer that only

recognizes program names, reserved words, and integer literals

• A nạve state diagram would have a transition

from every state on every character in the source language - such a diagram would be very large!

Trang 14

State Diagram Design (cont.)

• In many cases, transitions can be combined to

simplify the state diagram

– When recognizing an identifier, all uppercase and lowercase letters are equivalent - use a character

Trang 15

State Diagram Design (cont.)

• Convenient utility subprograms:

– getChar - gets the next character of input, puts

it in global variable nextChar, determines its

class and puts the class in global variable

charClass

– addChar - puts the character from nextChar into the place the lexeme (global variable) is being

accumulated

– lookup - determines whether the string in

lexeme is a reserved word (returns a code)

Trang 16

State Diagram

Trang 17

Lexical Analysis - Implementation

break;

Trang 18

Lexical Analysis - Implementation

} /* End of switch */

} /* End of function lex() */

Trang 19

A part of a Pascal scanner

• We read the characters one at a time with ahead

look-• If it is one of the one-character tokens

{ ( ) [ ] < > , ; = + - }

we announce that token

• If it is a „.‟, we look at the next character

– If that is a dot, we announce „ ‟

– Otherwise, we announce „.‟ and reuse the

look-ahead

Trang 20

• If it is a „<‟, we look at the next character

• If it is a digit, we keep reading until we find a

non-digit

– if that is not a „.‟ we announce an integer

– otherwise, we keep looking for a real number

– if the character after the „.‟ is not a digit, we announce

A part of a Pascal scanner (cont.)

Trang 21

State

Diagram

Trang 22

we skip any initial white space (spaces, tabs, and newlines)

we read the next character

if it is a ( we look at the next character

if that is a * we have a comment;

we skip forward through the terminating *)

otherwise

we return a left parenthesis and reuse the look-ahead

if it is one of the one-character tokens ([ ] , ; = + - etc.)

we return that token

if it is a we look at the next character

if that is a we return

otherwise we return and reuse the look-ahead

if it is a < we look at the next character

if that is a = we return <=

otherwise we return < and reuse the look-ahead

etc.

Trang 23

if it is a letter we keep reading letters and digits

and maybe underscores until we can‟t anymore;

then we check to see if it is a keyword

if so we return the keyword

otherwise we return an identifier

in either case we reuse the character beyond the end of

the token

if it is a digit we keep reading until we find a nondigit

if that is not a

we return an integer and reuse the nondigit

otherwise we keep looking for a real number

if the character after the is not a digit

we return an integer and reuse the and the look-ahead

etc.

Trang 24

The Parsing Problem

• Goals of the parser, given an input program:

– Find all syntax errors; for each, produce an

appropriate diagnostic message, and recover

quickly

– Produce the parse tree, or at least a trace of the parse tree, for the program

Trang 25

The Parsing Problem (cont.)

• Two categories of parsers

– Top down - produce the parse tree, beginning at the root

 Order is that of a leftmost derivation

 Traces the parse tree in preorder

– Bottom up - produce the parse tree, beginning at the leaves

 Order is that of the reverse of a rightmost derivation

• Parsers usually look only one token ahead in the input

Trang 26

The Set of Notational Conventions

• Terminal symbols – Lowercase letters at the

beginning of the alphabet (a, b, )

beginning of the alphabet (A, B, )

• Terminals or nonterminals - Uppercase letters at the end of the alphabet (W, X, Y, Z)

• Strings of terminals - Lowercase letters at the

end of the alphabet (w, x, y, z)

-Lowercase Greek letters (, , , )

Trang 27

The Parsing Problem (cont.)

• Top-down Parsers

– Given a sentential form, xA  , the parser must

choose the correct A-rule to get the next

sentential form in the leftmost derivation, using only the first token produced by A

• The most common top-down parsing

algorithms

– Recursive descent - a coded implementation

– LL parsers – table-driven implementation (1 st L

stands for left-to-right, 2 nd L stands for leftmost

Trang 28

The Parsing Problem (cont.)

• Bottom-up parsers

– Given a right sentential form,  , determine what substring of  is the RHS of the rule in the

grammar that must be reduced to produce the

previous sentential form (in the right derivation) – The most common bottom-up parsing algorithms are in the LR family

 L stands for left-to-right, R stands for rightmost derivation

Trang 29

• Consider the following grammar for a

comma-separated list of identifiers, terminated by a

semicolon

id_list  id id_list_tail

id_list_tail  , id id_list_tail

id_list_tail  ;

Trang 31

Recursive-Descent Parsing

• Recursive-Descent Process

– There is a subprogram for each nonterminal in

the grammar, which can parse sentences that can

be generated by that nonterminal

– EBNF is ideally suited for being the basis for a

recursive-descent parser, because EBNF

minimizes the number of nonterminals

• A grammar for simple expressions:

<expr>  <term> {(+ | –) <term>}

<term>  <factor> {(* | /) <factor>}

Trang 32

Recursive-Descent Parsing (cont.)

• Assume we have a lexical analyzer named lex, which puts the next token code in nextToken

• The coding process when there is only one RHS:

– For each terminal symbol in the RHS, compare it with the next input token; if they match,

continue, else there is an error

– For each nonterminal symbol in the RHS, call its associated parsing subprogram

Trang 33

Function expr()

/* Function expr()

Parses strings in the language

generated by the rule:

<expr> → <term> {(+ | -) <term>}

Trang 34

Function expr() (cont.)

/* As long as the next token is + or -, call

lex() to get the next token, and parse the

next term */

while (nextToken == PLUS_CODE ||

nextToken == MINUS_CODE) { lex();

term();

}

}

Trang 35

Recursive-Descent Parsing (cont.)

• A nonterminal that has more than one RHS

requires an initial process to determine which

RHS it is to parse

– The correct RHS is chosen on the basis of the next token of input

– The next token is compared with the first token

that can be generated by each RHS until a match is found

– If no match is found, it is a syntax error

Trang 37

Function factor() (cont.)

/* If the RHS is (<expr>) – call lex() to pass

over the left parenthesis, call expr(), and check for the right parenthesis */

Trang 38

The LL Grammar Class

• The Left Recursion Problem: If a grammar has

left recursion, either direct or indirect, it cannot

be the basis for a top-down parser

– A grammar can be modified to remove left

recursion

• Example: consider the following rule

A  A + B

– A recursive-descent parser subprogram for A

immediately calls itself to parse the first symbol

in its RHS …

Trang 39

Pairwise Disjointness Test

• The other characteristic of grammars that

disallows top-down parsing is the lack of pairwise disjointness

– The inability to determine the correct RHS on the basis of one token of lookahead

– FIRST(  ) = {a |   * a  } (If   *  ,  ∈ FIRST(  ))

• Pairwise Disjointness Test

– For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A  i

and A  j, it must be true that:

Trang 41

Left factoring

• This process can resolve the problem of pairwise disjointness test

• Example: consider the rules

<variable>  identifier | identifier [<expression>] – The two rules can be replace by

<variable>  identifier <new>

<new>   | [<expression>]

– or

<variable>  identifier [ [<expression>] ]

metasymbol

Trang 42

Bottom-up Parsing

• The process of bottom-up parsing produces the reverse of a rightmost derivation

• A bottom-up parser starts with the input

sentence and produces the sequence of

sentential forms from there until all that remains

is the start symbol

• In each step, the task of the bottom-up parser is finding the correct RHS in a right sentential form

to reduce to get the previous right sentential

form in the derivation

Trang 43

• The sentential form E + T * id includes three

RHSs, E + T, T, and id Only one of these is the correct one to be rewritten

– If the RHS E + T were chosen to be rewritten in this

sentential form, the resulting sentential form would be

E * id But E * id is not a legal right sentential form for

Trang 45

Example: Parser Tree of Sentential Form

E + T * id

• The phrases of the sentential form E + T * id are

E + T * id, T * id, and id

• The only simple phrase is id

• The handle of a rightmost sentential form is the

leftmost simple phrase

E

T

F

Trang 46

Example: Consider the string

(8)

Trang 48

LR Parsers

• Many different bottom-up parsing algorithms

have been devised Most of these are variations

of a process called LR parser

– L means it scans the input string left to right and the R means it produces a rightmost derivation

• The original LR algorithm was designed by

Donald Knuth (1965) This algorithm, which is sometimes called canonical LR

Trang 49

Advantages of LR parsers

• They will work for nearly all grammars that

describe programming languages

• They work on a larger class of grammars than

other bottom-up algorithms, but are as efficient

as any other bottom-up parser

• They can detect syntax errors as soon as it is

possible

• The LR class of grammars is a superset of the

class parsable by LL parsers

Trang 51

Configurations

• The contents of the parse stack for an LR parser has the following form:

S0X1S1…XmSm  top of stack

where the Si are state symbols, the Xi are grammar symbols

• An LR parsing table has two parts:

– The ACTION part has state symbols as its row

labels and the terminal symbols as its column

labels

– The GOTO part has state symbols as its row labels and the nonterminals symbols as column labels

Trang 52

Configurations (cont.)

• The input string has a „$‟ at its right end It is

used for normal termination of the parser

• An LR parser configuration is a pair of strings

(stack, input), with the detailed form

(S0X1S1…XmSm, aiai+1 … an$)

• The initial configuration of an LR parser is

(S0, a1a2 … an$)

Trang 53

The Parser Actions

Trang 54

Example: The Grammar for Arithmetic Expressions

Trang 56

A Trace of a Parse of the String id + id * id

Trang 57

– Detects syntax errors

– Produces a parse tree

• A recursive-descent parser is an LL parser

• Parsing problem for bottom-up parsers: find the

substring of current sentential form

• The LR family of shift-reduce parsers is the most

common bottom-up parsing approach

Ngày đăng: 23/03/2022, 08:27

TỪ KHÓA LIÊN QUAN