• Read the input characters • Produce as output a sequence of tokens • Eliminate white space and comments lexical symbol table source program token get next token... Tokens, Patterns,
Trang 1LEXICAL ANALYSIS
Phung Hua Nguyen
University of Technology
2006
Trang 3• Read the input characters
• Produce as output a sequence of tokens
• Eliminate white space and comments
lexical
symbol table
source program
token get next token
Trang 4Why ?
• Simplify design
• Improve compiler efficiency
• Enhance compiler portability
Trang 5Tokens, Patterns, Lexemes
Token Sample Lexeme Informal description of pattern
const const const
relation <,<=,==,!=,>,>= < or <= or == or != or > or >=
id pi, count, x2 letter followed by letters or digits
num 3.14, 25, 6.02E3 any numeric constant
literal “core dumped” any characters between “ and “ except “
Trang 7Alphabet, Strings and Languages
• Alphabet ∑: any finite set of symbols
– The Vietnamese alphabet {a, á, à, , ã, , b, c, d, đ,…}
– The binary alphabet {0,1}
– The ASCII alphabet
• String : a finite sequence of symbols drawn from ∑ :
– Length |s| of a string s: the number of symbols in s
– The empty string, denoted ∈, | ∈ | = 0
• Language : any set of strings over ∑;
– its two special cases:
• ∅: the empty set
• { ∈ }
Trang 8– The set of Pentium instructions
• ∑ = the ASCII set
– A string is a program
– The set of C programs
Trang 9Terms (Fig.3.7)
prefix of s a string obtained by removing 0 or more trailing
symbols of s;
e.g ban is a prefix of banana
suffix of s a string formed by deleting 0 or more the leading
symbols of s;
e.g na is a suffix of banana
substring of s a string obtained by deleting a prefix and a suffix from
Trang 10String operations
• String concatenation
– If x and y are strings, xy is the string formed
by appending y to x.
E.g.: x = hom, y = nay ⇒ xy = homnay
– ∈ is the identity: ∈y = y; x∈ = x
• String exponentiation
– s0 = ∈
– si = si-1s
E.g s = 01, s 0 = ∈, s 2 = 0101, s 3 = 010101
Trang 11Language Operations (Fig 3.8)
union: L ∪ M L ∪ M = { s | s ∈ L or s ∈ M }concatenation: LM LM= { st | s ∈ L and t ∈ M }
Kleene closure: L* L* = L0 ∪ L ∪ LL ∪ LLL ∪ …
where L0 = {∈}
0 or more concatenations of Lpositive closure: L+ L+ = L ∪ LL ∪ LLL ∪ …
1 or more concatenations of L
Trang 12all strings of letters, including ∈ all strings of letters and digits beginning with a letter all strings of one or more digits
Trang 13Regular Expressions (REs) over
• Inductive base :
1 ∈ is a RE, denoting the RL {∈}
2 a ∈ ∑ is a RE, denoting the RL {a}
• Inductive step : Suppose r and s are REs,
denoting the language L(r) and L(s) Then
3 (r)|(s) is a RE, denoting the RL L(r) ∪ L(s)
4 (r)(s) is a RE, denoting the RL L(r)L(s)
5 (r)* is a RE, denoting the RL (L(r))*
6 (r) is a RE, denoting the RL L(r)
Trang 14Precedence and Associativity
• Precedence:
– “*” has the highest precedence
– “concatenation” has the second highest precedence – “|” has the lowest precedence
• Associativity:
– all are left-associative
E.g.: (a)|((b)*(c)) ≡ a|b*c
Trang 15• ∑ = {a, b}
1 a|b denotes {a,b}
2 (a|b)(a|b) denotes {aa,ab,ba,bb}
3 a* denotes {∈,a,aa,aaa,aaaa,…}
4 (a|b)* denotes ?
5 a|a*b denotes ?
Trang 16Notational Shorthands
• One or more instances +: r+ = rr*
– denotes the language (L(r))+
• Zero or one instance ?: r? = r|∈
– denotes the language (L(r) ∪ {∈})
• Character classes
– [abc] denotes a|b|c
– [A-Z] denotes A|B|…|Z
– [a-zA-Z_][a-zA-Z0-9_]* denotes ?
Trang 183.3
Trang 19Nondeterministic finite automata
• A nondeterministic finite automaton (NFA)
is a mathematical model that consists of
– a finite set of states S
– a set of input symbols ∑
– a transition function move: S × ∑ → S
– a start state s 0
– a finite set of final or accepting states F
Trang 22• A NFA accepts an input string x iff there is some path in the transition graph from
start state to some accepting state such
that the edge labels along this path spell
0
Trang 23Deterministic finite automata
• A deterministic finite automaton (DFA) is
a special case of NFA in which
1 no state has an ∈-transition, and
2 for each state s and input symbol a, there is
at most one edge labeled a leaving s
Trang 24Thompson’s construction of NFA
Trang 25Thompson’s construction (cont’d)
• Suppose N(s) and N(t) are NFA’s for REs
Trang 26– REs ⇒ NFA (Thompson’s construction) √
– NFA ⇒ DFA (subset construction)
– DFA ⇒ minimal DFA (Algorithm 3.6)
• Programming
Trang 27Subset construction
Operation Description
∈-closure(s) Set of NFA states reachable from state s on
∈-transition alone
∈-closure(T) Set of NFA states reachable from some
state s in T on ∈-transition alone
move(T,a) Set of NFA states to which there is a
transition on input a from some state s in T
• s : an NFA state
• T : a set of NFA states
Trang 28Subset construction (cont’d)
Let s0 be the start state of the NFA;
Dstates contains the only unmarked state ∈-closure(s 0 );
while there is an unmarked state T in Dstates do begin
mark T
for each input symbol a do begin
U := ∈-closure(move(T; a));
if U is not in Dstates then
Add U as an unmarked state to Dstates ;
DTran [T; a] := U;
end;
end;
Trang 29• Let (∑, S, T, F, s0) be the original NFA The DFA is:
• The alphabet: ∑
• The states: all states in Dstates
• The transitions: DTran
• The accepting states: all states in Dstates
containing at least one accepting state in F of
the NFA
• The start state: ∈-closure(s0)
Trang 30– REs ⇒ NFA (Thompson’s construction) √
– NFA ⇒ DFA (subset construction) √
– DFA ⇒ minimal DFA (Algorithm 3.6)
• Programming
Trang 31Minimise a DFA
Initially, create two states:
1 one is the set of all final states: F
2 the other is the set of all non-final states: S - F
while (more splits are possible) {
Let S = {s1,…, sn} be a state and c be any char in ∑
Let t1,…, tn be the successor states to s1,…, sn under c
if (t1,…, tn don't all belong to the same state) {
Split S into new states so that si and sj remain in the same state iff ti and tj are in the same state
}
}
Trang 32C b
b
b
b b
a a
a
a
Trang 33– REs ⇒ NFA (Thompson’s construction) √
– NFA ⇒ DFA (subset construction) √
– DFA ⇒ minimal DFA (Algorithm 3.6) √
• Programming
Trang 34Input Buffering
b e g i n
…
Scanner
eof
if (forward at end of first half) {
reload second half forward++
} else
if (forward at end of second half) {
reload first half forward = 0
} else
forward++
Trang 35Input Buffering
b e g i n
if (forward at end of first half) {
reload second half forward++
} else
if (forward at end of second half) {
reload first half forward = 0
} else
terminate the analysis
}
Trang 38move forward back
get lexeme from beginning to forward
move forward onward
beginning = forward
state = 0
}
b e g i n : = …
Trang 39– REs ⇒ NFA (Thompson’s construction) √
– NFA ⇒ DFA (subset construction) √
– DFA ⇒ minimal DFA (Algorithm 3.6) √