Some regular expression implementations also support lookbehind using ?Lesson 10.. Embedding Conditions A powerful yet infrequently used feature of the regular expression language is th
Trang 1Without word boundaries, the 0 in $30 was also matched Why? Because there is
$ in front of it Enclosing the entire pattern within word boundaries solves this problem
Summary
Looking ahead and behind provides greater control over what is returned when matches are made The lookaround operations allow subexpressions to be used to specify the location of text to be matched but not consumed (matched, but not included in the matched text itself) Positive lookahead is defined using (?=), and negative lookahead is defined using (?!) Some regular expression
implementations also support lookbehind using (?<=) and negative lookahead using (?<!)
Lesson 10 Embedding Conditions
A powerful yet infrequently used feature of the regular expression language is the capability to embed conditional processing within an expression This lesson will explore this topic
Why Embed Conditions?
(123)456-7890 and 123-456-7890 are both acceptable presentation formats for North American phone numbers 1234567890, (123)-456-7890, and
(123-456-7890 all contain the correct number of digits, but are badly formatted How could you write a regular expression to match only the acceptable formats and not any others?
This is not a trivial problem; consider this obvious solution:
123-456-7890
(123)456-7890
(123)-456-7890
(123-456-7890
Trang 21234567890
123 456 7890
\(?\d{3}\)?-?\d{3}-\d{4}
123-456-7890
(123)456-7890
(123)-456-7890
(123-456-7890
1234567890
123 456 7890
\(? matches an optional opening parenthesis (notice that ( must be escaped), \d{3} matches the first three digits, \)? matches an optional closing parenthesis, -?
matches an optional hyphen, and \d{3}-\d{4} matches the remaining seven digits (separated by a hyphen) The pattern correctly did not match the last two lines, but
it did match the third and fourth—both of which are incorrect (the third contains both ) and -, and the fourth has an unmatched parenthesis)
Replacing \)?? with [\)]? will help eliminate the third line (by allowing only ) or -, but not both) but the fourth line is a problem The pattern needs to match ) only if there is an opening ( In truth, the pattern needs to match ) if there is an opening (
If not, it needs to match -, and that type of pattern cannot be implemented without conditional processing
Caution
Trang 3Conditional processing is not supported by all regular expression
implementations
Using Conditions
Regular expression conditions are defined using ? In fact, you have already seen a couple of very specific conditions:
? matches the previous character or expression if it exists
?= and ?<= match text ahead or behind, if it exists
Embedded condition syntax also uses ?, which is not surprising considering that the conditions that are embedded are the same two just listed:
Conditional processing based on a backreference
Conditional processing based on lookaround
Backreference Conditions
A backreference condition allows for an expression to be used only if a previous subexpression search was successful If that sounds obscure, consider an example: You need to locate all <IMG> tags in your text; in addition, if any <IMG> tags are links (enclosed between <A> and </A> tags), you need to match the complete link tags as well
The syntax for this type of condition is (?(backreference)true) The ? starts the condition, the backreference is specified within parentheses, and the expression to be evaluated only if the backreference is present immediately
follows
Now for the example:
<! Nav bar >
<TD>
<A HREF="/home"><IMG SRC="/images/home.gif"></A>
Trang 4<IMG SRC="/images/spacer.gif">
<A HREF="/search"><IMG SRC="/images/search.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/help"><IMG SRC="/images/help.gif"></A>
</TD>
(<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)
<! Nav bar >
<TD>
<A HREF="/home"><IMG SRC="/images/home.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/search"><IMG SRC="/images/search.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/help"><IMG SRC="/images/help.gif"></A>
</TD>
This pattern requires explanation (<[Aa]\s+[^>]+>\s*)? matches an
opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional) <[Ii][Mm][Gg]\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes
(?(1)\s*</[Aa]>) starts off with a condition: ?(1) means execute only what
Trang 5comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful) If (1)
exists, then \s*</[Aa]> matches any trailing whitespace followed by the
closing </A> tag
Note
?(1) checks to see if backreference 1 exists The backreference
number (1 in this example) does not need to be escaped in
conditions So, ?(1) is correct, and ?(\1) is not (although the
latter will usually work, too)
The pattern just used executes an expression if a condition is met Conditions can also have else expressions, expressions that are executed only if the backreference does not exist (the condition is not met) The syntax for this form of condition is (?(backreference)true|false) This syntax accepts a condition, as well
as the expressions to be executed if the condition is met or not met
This syntax provides the solution for the phone number problem as shown here:
123-456-7890
(123)456-7890
(123)-456-7890
(123-456-7890
1234567890
123 456 7890
(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}
Trang 6123-456-7890 (123)456-7890 (123)-456-7890