Introduction to Regexes and Pattern Matching | 3Introduction to Regexes and Pattern Matching A regular expression is a string containing a combination of normal characters and special me
Trang 2Regular Expression
Pocket Reference
Trang 5by Tony Stubblebine
Copyright © 2007, 2003 Tony Stubblebine All rights reserved Portions of
this book are based on Mastering Regular Expressions, by Jeffrey E F Friedl,
Copyright © 2006, 2002, 1997 O’Reilly Media, Inc
Editor: Andy Oram
Production Editor: SumitaMukherji
Copyeditor: Genevieve d’Entremont
Indexer: Johnna VanHoose Dinse
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Printing History:
August 2003: First Edition
July 2007: Second Edition
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are
registered trademarks of O’Reilly Media, Inc The Pocket Reference series designations, Regular Expression Pocket Reference, the image of owls, and
related trade dress are trademarks of O’Reilly Media, Inc
Many of the designations used by manufacturers and sellers to distinguishtheir products are claimed as trademarks Where those designations appear
in this book, and O’Reilly Media, Inc was aware of a trademark claim, thedesignations have been printed in caps or initial caps
Java™is a trademark of Sun Microsystems, Inc Microsoft Internet Explorerand NET are registered trademarks of Microsoft Corporation Spider-Man
is a registered trademark of Marvel Enterprises, Inc
While every precaution has been taken in the preparation of this book, thepublisher and author assume no responsibility for errors or omissions, or fordamages resulting from the use of the information contained herein
Trang 6Contents
Introduction to Regexes and Pattern Matching 3
Regex Metacharacters, Modes, and Constructs 5
Trang 7.NET and C# 38
Supported Metacharacters 38 Regular Expression Classes and Interfaces 42
Supported Metacharacters 77 Pattern-Matching Methods and Objects 79
Trang 10is well-formed.
Today, regular expressions are included in most ming languages, as well as in many scripting languages,editors, applications, databases, and command-line tools.This book aims to give quick access to the syntax andpattern-matching operations of the most popular of theselanguages so that you can apply your regular-expressionknowledge in any environment
program-The second edition of this book adds sections on Ruby andApache web server, common regular expressions, and alsoupdates existing languages
About This Book
This book starts with a general introduction to regularexpressions The first section describes and defines theconstructs used in regular expressions, and establishes thecommon principles of pattern matching The remaining sec-tions of the book are devoted to the syntax, features, andusage of regular expressions in various implementations.The implementations covered in this book are Perl, Java™,.NET and C#, Ruby, Python, PCRE, PHP, Apache web
server, vi editor, JavaScript, and shell tools.
Trang 11Conventions Used in This Book
The following typographical conventions are used in thisbook:
Constant width italic
Used for text that should be replaced with user-suppliedvalues
Constant width bold
Used in examples for commands or other text thatshould be typed literally by the user
Acknowledgments
Jeffrey E F Friedl’s Mastering Regular Expressions (O’Reilly)
is the definitive work on regular expressions While writing, Irelied heavily on his book and his advice As a convenience,
this book provides page references to Mastering Regular
Expressions, Third Edition (MRE) for expanded discussion of
regular expression syntax and concepts
Nat Torkington and Linda Mui were excellent editors whoguided me through what turned out to be a tricky first edi-tion This edition was aided by the excellent editorial skills ofAndy Oram Sarah Burcham deserves special thanks forgiving me the opportunity to write this book, and for hercontributions to the “Shell Tools” section More thanks forthe input and technical reviews from Jeffrey Friedl, PhilipHazel, Steve Friedl, Ola Bini, Ian Darwin, Zak Greant, RonHitchens, A.M Kuchling, Tim Allwine, Schuyler Erle, DavidLents, Rabble, Rich Bowan, Eric Eisenhart, and Brad Merrill
Trang 12Introduction to Regexes and Pattern Matching | 3
Introduction to Regexes and Pattern Matching
A regular expression is a string containing a combination of
normal characters and special metacharacters or quences The normal characters match themselves
metase-Metacharacters and metasequences are characters or sequences
of characters that represent ideas such as quantity, locations,
or types of characters The list in “Regex Metacharacters,Modes, and Constructs” shows the most common metachar-acters and metasequences in the regular expression world.Later sections list the availability of and syntax for sup-ported metacharacters for particular implementations ofregular expressions
Pattern matching consists of finding asection of text that is
described (matched) by a regular expression The underlying
code that searches the text is the regular expression engine.
You can predict the results of most matches by keeping tworules in mind:
1 The earliest (leftmost) match wins
Regular expressions are applied to the input starting atthe first character and proceeding toward the last Assoon as the regular expression engine finds a match, itreturns (See MRE 148–149.)
2 Standard quantifiers are greedy
Quantifiers specify how many times something can berepeated The standard quantifiers attempt to match asmany times as possible They settle for less than the max-imum only if this is necessary for the success of thematch The process of giving up characters and tryingless-greedy matches is called backtracking (See MRE151–153.)
Regular expression engines have differences based on theirtype There are two classes of engines: Deterministic FiniteAutomaton (DFA) and Nondeterministic Finite Automaton
Trang 13(NFA) DFAs are faster, but lack many of the features of anNFA, such as capturing, lookaround, and nongreedy quanti-fiers In the NFA world, there are two types: traditional andPOSIX.
DFA engines
DFAs compare each character of the input string to theregular expression, keeping track of all matches inprogress Since each character is examined at most once,the DFA engine is the fastest One additional rule toremember with DFAs is that the alternation metase-quence is greedy When more than one option in analternation (foo|foobar) matches, the longest one isselected So, rule No 1 can be amended to read “thelongest leftmost match wins.” (See MRE 155–156.)
Traditional NFA engines
Traditional NFA engines compare each element of theregex to the input string, keeping track of positionswhere it chose between two options in the regex If anoption fails, the engine backtracks to the most recentlysaved position For standard quantifiers, the enginechooses the greedy option of matching more text; how-ever, if that option leads to the failure of the match, theengine returns to a saved position and tries a less greedypath The traditional NFA engine uses orderedalternation, where each option in the alternation is triedsequentially A longer match may be ignored if an earlieroption leads to a successful match So, here rule #1 can
be amended to read “the first leftmost match after greedyquantifiers have had their fill wins.” (See MRE 153–154.)
POSIX NFA engines
POSIX NFA Engines work similarly to Traditional NFAswith one exception: a POSIX engine always picks thelongest of the leftmost matches For example, the alter-nation cat|category would match the full word
“category” whenever possible, even if the first alternative(“cat”) matched and appeared earlier in the alternation
Trang 14Introduction to Regexes and Pattern Matching | 5
Regex Metacharacters, Modes, and Constructs
The metacharacters and metasequences shown here sent most available types of regular expression constructsand their most common syntax However, syntax and avail-ability vary by implementation
repre-Character representations
Many implementations provide shortcuts to represent acters that may be difficult to input (See MRE 115–118.)
char-Character shorthands
Most implementations have specific shorthands for the
alert, backspace, escape character, form feed, newline,
carriage return, horizontal tab, a nd vertical tab
characters For example,\nis often ashorthand for thenewline character, which is usually LF (012 octal), butcan sometimes be CR (015 octal), depending on the oper-ating system Confusingly, many implementations use\b
to mean both backspace and word boundary (positionbetween a “word” character and a nonword character).For these implementations,\bmeansbackspacein achar-acter class (a set of possible characters to match in thestring), and word boundary elsewhere
Octal escape: \num
Represents a character corresponding to a two- or digit octal number For example, \015\012 matches anASCII CR/LF sequence
Hex and Unicode escapes: \xnum , \x{num} , \unum , \Unum
Represent characters corresponding to hexadecimal bers Four-digit and larger hex numbers can represent therange of Unicode characters For example, \x0D\x0A
num-matches an ASCII CR/LF sequence
Control characters: \cchar
Corresponds to ASCII control characters encoded withvalues less than 32 To be safe, always use an uppercase
char—some implementations do not handle lowercase
Trang 15representations For example,\cHmatches Control-H, anASCII backspace character.
Character classes and class-like constructs
Character classes are used to specify a set of characters A
char-acter class matches a single charchar-acter in the input string that iswithin the defined set of characters (See MRE 118–128.)
Normal classes:[ ] and[^ ]
Character classes, [ ], and negated character classes,
[^ ], allow you to list the characters that you do or donot want to match A character class always matches onecharacter The- (dash) indicates a range of characters.For example,[a-z]matches any lowercase ASCII letter
To include the dash in the list of characters, either list itfirst, or escape it
Almost any character: dot (.)
Usually matches any character except a newline ever, the match mode usually can be changed so that dotalso matches newlines Inside a character class, dotmatches just a dot
How-Class shorthands:\w,\d,\s,\W,\D,\S
Commonly provided shorthands for word character,digit, and space character classes A word character isoften all ASCII alphanumeric characters plus the under-score However, the list of alphanumerics can includeadditional locale or Unicode alphanumerics, depending
on the implementation A lowercase shorthand (e.g.,\s)matches a character from the class; uppercase (e.g.,\S)matches a character not from the class For example,\d
matches a single digit character, and is usually lent to[0-9]
equiva-POSIX character class: [:alnum:]
POSIX defines several character classes that can be usedonly within regular expression character classes (seeTable 1) Take, for example,[:lower:] When written as
Trang 16Introduction to Regexes and Pattern Matching | 7
Unicode properties, scripts, and blocks: \p{prop} , \P{prop}
The Unicode standard defines classes of characters thathave a particular property, belong to a script, or exist
within ablock Properties are the character’s defining
char-acteristics, such as being a letter or a number (see Table 2)
Scripts are systems of writing, such as Hebrew, Latin, or
Han Blocks are ranges of characters on the Unicode
char-acter map Some implementations require that Unicodeproperties be prefixed withIsorIn For example,\p{Ll}
matches lowercase letters in any Unicode-supported guage, such asa orα
lan-Unicode combining character sequence:\X
Matches a Unicode base character followed by anynumber of Unicode-combining characters This is ashorthand for\P{M}\p{M} For example,\Xmatchesè;aswell as the two characterse'
Table 1 POSIX character classes
Digit Decimal digits
Graph Printing characters, excluding space
Lower Lowercase letters
Print Printing characters, including space
Punct Printing characters, excluding letters and digits
Space Whitespace
Upper Uppercase letters
Xdigit Hexadecimal digits
Trang 17Table 2 Standard Unicode properties
\p{Lu} Uppercase letters
\p{C} Control codes and characters not in other categories
\p{Cc} ASCII and Latin-1 control characters
\p{Cf} Nonvisible formatting characters
\p{Cn} Unassigned code points
\p{Co} Private use, such as company logos
\p{Cs} Surrogates
\p{M} Marks meant to combine with base characters, such as accent
marks
\p{Mc} Modification characters that take up their own space Examples
include “vowel signs.”
\p{Me} Marks that enclose other characters, such as circles, squares, and
diamonds
\p{Mn} Characters that modify other characters, such as accents and
umlauts
\p{N} Numeric characters
\p{Nd} Decimal digits in various scripts
\p{Nl} Letters that represent numbers, such as Roman numerals
\p{No} Superscripts, symbols, or nondigit characters representing
numbers
\p{P} Punctuation
\p{Pc} Connecting punctuation, such as an underscore
\p{Pd} Dashes and hyphens
\p{Pe} Closing punctuation complementing\p{Ps}
\p{Pi} Initial punctuation, such as opening quotes
Trang 18Introduction to Regexes and Pattern Matching | 9
Anchors and zero-width assertions
Anchors and “zero-width assertions” match positions in theinput string (See MRE 128–134.)
Start of line/string:^,\A
Matches at the beginning of the text being searched Inmultiline mode, ^ matches after any newline Someimplementations support\A, which matches only at thebeginning of the text
End of line/string:$,\Z,\z
$ matches at the end of a string In multiline mode, $
matches before any newline When supported,\Zmatchesthe end of string or the point before astring-ending new-line, regardless of match mode Some implementationsalso provide\z, which matches only the end of the string,regardless of newlines
\p{Pf} Final punctuation, such as closing quotes
\p{Po} Other punctuation marks
\p{Ps} Opening punctuation, such as opening parentheses
\p{S} Symbols
\p{Sc} Currency
\p{Sk} Combining characters represented as individual characters
\p{Sm} Math symbols
\p{So} Other symbols
\p{Z} Separating characters with no visual representation
Trang 19Start of match:\G
In iterative matching,\Gmatches the position where theprevious match ended Often, this spot is reset to thebeginning of a string on a failed match
Word boundary:\b,\B,\<,\>
Word boundary metacharacters match a location where aword character is next to a nonword character.\b oftenspecifies a word boundary location, and\Boften specifies anot-word-boundary location Some implementations pro-vide separate metasequences for start- and end-of-wordboundaries, often\< and\>
Lookahead:(?= ),(?! )
Lookbehind:(?<= ),(?<! )
Lookaround constructs match a location in the text where
the subpattern would match (lookahead), would notmatch (negative lookahead), would have finished match-ing (lookbehind), or would not have finished matching(negative lookbehind) For example,foo(?=bar)matches
fooinfoobar, but notfood Implementations often limitlookbehind constructs to subpatterns with a predeter-mined length
Comments and mode modifiers
Mode modifiers change how the regular expression engineinterprets a regular expression (See MRE 110–113, 135–136.)
Trang 20Introduction to Regexes and Pattern Matching | 11
Mode modifiers:(?i),(?-i), (?mod: )
Usually, mode modifiers may be set within a regularexpression with(?mod)to turn modes on for the rest ofthe current subexpression;(?-mod)to turn modes off forthe rest of the current subexpression; and(?mod: )toturn modes on or off between the colon and the closingparentheses For example, use (?i:perl) matches useperl,use Perl,use PeRl, etc
Literal-text span:\Q \E
Escapes metacharacters between\Qand\E For example,
\Q(.*)\E is the same as\(\.\*\)
Grouping, capturing, conditionals, and control
This section covers syntax for grouping subpatterns, ing submatches, conditional submatches, and quantifying thenumber of times a subpattern matches (See MRE 137–142.)
captur-Capturing and grouping parentheses:( ) and\1,\2, etc.
Parentheses perform two functions: grouping and ing Text matched by the subpattern within parentheses iscaptured for later use Capturing parentheses are num-bered by counting their opening parentheses from the left
captur-If backreferences are available, the submatch can bereferred to later in the same match with\1,\2, etc The
Trang 21captured text is made available after a match byimplementation-specific methods For example,\b(\w+)\b
\s+\1\b matches duplicate words, such asthe the
Grouping-only parentheses:(?: )
Groups a subexpression, possibly for alternation or fiers, but does not capture the submatch This is useful forefficiency and reusability For example,(?:foobar)matches
quanti-foobar, but does not save the match to a capture group
Named capture: (?<name> )
Performs capturing and grouping, with captured text laterreferenced byname For example,Subject:(?<subject>.*)
captures the text following Subject: to acapture groupthat can be referenced by the namesubject
matches the wordsfoo orbar
Conditional: (?(if)then |else)
The if is implementation-dependent, but generally is areference to a captured subexpression or a lookaround.Thethenandelseparts are both regular expression pat-terns If theifpart is true, thethenis applied Otherwise,
elseis applied For example,(<)?foo(?(1)>|bar)matches
<foo> as well asfoobar
Greedy quantifiers:*,+,?, {num , num }
The greedy quantifiers determine how many times a struct may be applied They attempt to match as manytimes as possible, but will backtrack and give up matches
Trang 22con-Regular Expression Cookbook | 13
Lazy quantifiers:*?,+?,??, {num , num }?
Lazy quantifiers control how many times a construct may
be applied However, unlike greedy quantifiers, theyattempt to match as few times as possible For example,
(an)+? matches onlyan ofbanana
Possessive quantifiers:*+,++,?+, {num , num }+
Possessive quantifiers are like greedy quantifiers, exceptthat they “lock in” their match, disallowing later back-tracking to break up the submatch For example,
(ab)++ab will not matchababababab
Unicode Support
The Unicode character set gives unique numbers to the
characters in all the world’s languages Because of the largenumber of possible characters, Unicode requires more thanone byte to represent a character Some regular expressionimplementations will not understand Unicode charactersbecause they expect 1 byte ASCII characters Basic supportfor Unicode characters starts with the ability to match a lit-eral string of Unicode characters Advanced support includescharacter classes and other constructs that incorporate char-acters from all Unicode-supported languages For example,\w
might matchè; as well ase
Regular Expression Cookbook
This section contains simple versions of common regularexpression patterns You may need to adjust them to meetyour needs
Each expression is presented here with target strings that itmatches, and target strings that it does not match, so you canget a sense of what adjustments you may need to make foryour own use cases
They are written in the Perl style:
/pattern/mode
s/pattern/replacement/mode
Trang 23Removing leading and trailing whitespace
s/^\s+//
s/\s+$//
Matches: " foo bar ", "foo "
Nonmatches:"foo bar"
Trang 24Regular Expression Cookbook | 15
Match date: MM/DD/YYYY HH:MM:SS
Matches: tony@example.com, tony@i-e.com, tony@mail.example.museum
Nonmatches:.@example.com, tony@i-.com, tony@example.a
(See MRE 70.)
Trang 25HTTP URL
/(https?):\/\/([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+
[a-zA-Z]{2,9})
(:\d{1,4})?([-\w\/#~:.?+=&%@~]*)/
Matches: https://example.com,http://foo.com:8080/bar.html
Nonmatches: ftp://foo.com, ftp://foo.com/
Perl 5.8
Perl provides a rich set of regular-expression operators, structs, and features, with more being added in each newrelease Perl uses a Traditional NFA match engine For anexplanation of the rules behind an NFA engine, see “Intro-duction to Regexes and Pattern Matching.”
con-This reference covers Perl version 5.8 A number of new tures will be introduced in Perl 5.10; these are covered inTable 8 Unicode features were introduced in 5.6, but didnot stabilize until 5.8 Most other features work in versions5.004 and later
fea-Supported Metacharacters
Perl supports the metacharacters and metasequences listed inTable 3 through Table 7 To learn more about expanded def-initions of each metacharacter, see “Regex Metacharacters,Modes, and Constructs.”
Table 3 Perl character representations
Sequence Meaning
\a Alert (bell)
\b Backspace; supported only in character class (outside of
character class matches a word boundary)
\e Esc character,x1B
\n Newline;x0A on Unix and Windows,x0D on Mac OS 9
\r Carriage return;x0D on Unix and Windows,x0A on Mac OS 9
Trang 26Perl 5.8 | 17
\f Form feed,x0C
\t Horizontal tab,x09
\octal Character specified by a two- or three-digit octal code
\xhex Character specified by a one- or two-digit hexadecimal code
\x{hex} Character specified by any hexadecimal code
\cchar Named control character
\N{name} A named character specified in the Unicode standard or listed in
PATH_TO_PERLLIB/unicode/Names.txt; requiresusecharnames ':full'
Table 4 Perl character classes and class-like constructs
[ ] A single character listed, or contained in a listed range
[^ ] A single character not listed, and not contained within a listed
range
[:class:] POSIX-style character class valid only within a regex character
class
Any character except newline (unless single-line mode,/s)
\C One byte; however, this may corrupt a Unicode character
stream
\X Base character, followed by any number of Unicode combining
characters
\w Word character,\p{IsWord}
\W Nonword character,\P{IsWord}
\d Digit character,\p{IsDigit}
\D Nondigit character,\P{IsDigit}
\s Whitespace character,\p{IsSpace}
\S Nonwhitespace character,\P{IsSpace}
\p{prop} Character contained by given Unicode property, script, or block
\P{prop} Character not contained by given Unicode property, script, or
block
Table 3 Perl character representations (continued)
Sequence Meaning
Trang 27Table 5 Perl anchors and zero-width tests
Sequence Meaning
^ Start of string, or, in multiline match mode (/m), the position
after any newline
\A Start of search string, in all match modes
$ End of search string or the point before a string-ending newline,
or, in multiline match mode (/m), the position before anynewline
\Z End of string, or the point before a string-ending newline, in any
match mode
\z End of string, in any match mode
\G Beginning of current search
\b Word boundary
\B Not-word-boundary
(?= ) Positive lookahead
(?! ) Negative lookahead
(?<= ) Positive lookbehind; fixed-length only
(?<! ) Negative lookbehind; fixed-length only
Table 6 Perl comments and mode modifiers
Modifier Meaning
/i Case-insensitive matching
/m ^ and$ match next to embedded\n
/s Dot (.) matches newline
/x Ignore whitespace, and allow comments (#) in pattern
/o Compile pattern only once
(?mode) Turn listed modes (one or more ofxsmi) on for the rest of the
Trang 28Perl 5.8 | 19
(?# ) Treat substring as a comment
# Treat rest of line as a comment in/x mode
\u Force next character to uppercase
\l Force next character to lowercase
\U Force all following characters to uppercase
\L Force all following characters to lowercase
\Q Quote all following regex metacharacters
\E End a span started with\U,\L, or\Q
Table 7 Perl grouping, capturing, conditional,
and control
Sequence Meaning
( ) Group subpattern and capture submatch into\1,\2, and$1,
$2,
\n Contains text matched by thenth capture group
(?: ) Groups subpattern, but does not capture submatch
(?> ) Atomic grouping
| Try subpatterns in alternation
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactlyn times
{n,} Match at leastn times
{x,y} Match at leastx times, but no more thany times
*? Match 0 or more times, but as few times as possible
+? Match 1 or more times, but as few times as possible
?? Match 0 or 1 times, but as few times as possible
{n,}? Match at leastn times, but as few times as possible
{x,y}? Match at leastx times, and no more thany times, but as few
times as possible
Table 6 Perl comments and mode modifiers (continued)
Modifier Meaning
Trang 29)
Match with if-then-else pattern, whereCOND is an integerreferring to a backreference, or a lookaround assertion
(?(COND) ) Match with if-then pattern
(?{CODE}) Execute embedded Perl code
(??{CODE}) Match regex from embedded Perl code
Table 8 New features in Perl 5.10
Backreference to named capture group
%+ Hash reference to the leftmost capture of a given name,
$+{foo}
%- Hash reference to an array of all captures of a given name,
$-{foo}[0]
\g{n} or\gn Back reference to thenth capture
\g{-n} or\g-n Relative backreference to thenth previous capture
(?n) Recurse into thenth capture buffer
(?&NAME) Recurse into the named capture buffer
(?R) Recursively call the entire expression
(?(DEFINE) ) Define a subexpression that can be recursed into
(*FAIL) Fail submatch, and force the engine to backtrack
(*ACCEPT) Force engine to accept the match, even if there is more pattern
to check
(*PRUNE) Cause the match to fail from the current starting position
(*MARK:name) Marks and names the current position in the string The position
is available in$REGMARK
(*SKIP:name) Reject all matches up to the point where the namedMARK was
executed
(*THEN) When backtracked into, skip to the next alternation
Table 7 Perl grouping, capturing, conditional,
and control (continued)
Sequence Meaning
Trang 30Perl 5.8 | 21
Regular Expression Operators
Perl provides the built-in regular expression operatorsqr//,
m//, a nds///, as well as thesplit function Each operatoraccepts a regular expression pattern string that is run throughstring and variable interpolation, and then compiled.Regular expressions are often delimited with the forwardslash, but you can pick any nonalphanumeric, non-whitespace character Here are some examples:
Using the single quote as a delimiter suppresses interpolation
of variables and the constructs\N{name},\u,\l,\U,\L,\Q, a nd
\E Normally, these are interpolated before being passed tothe regular expression engine
qr// (Quote Regex)
qr/PATTERN/ismxo
Quote and compilePATTERNas a regular expression The returnedvalue may be used in a later pattern match or substitution Thissaves time if the regular expression is going to be interpolatedrepeatedly The match modes (or lack of),/ismxo, are locked in
(*COMMIT) When backtracked into, cause the match to fail outright
/p Mode modifier that enables the${^PREMATCH},${MATCH},
and ${^POSTMATCH} variables
\K Exclude previously matched text from the final match
Table 8 New features in Perl 5.10 (continued)
Modifier Meaning
Trang 31m// (Matching)
m/PATTERN/imsxocg
MatchPATTERNagainst input string In list context, returns a list ofsubstrings matched by capturing parentheses, or else(1) for asuccessful match or ( ) for a failed match In scalar context,returns1for success, or""for failure./imsxoare optional modemodifiers./cgare optional match modifiers./gin scalar contextcauses the match to start from the end of the previous match Inlist context, a /g match returns all matches, or all capturedsubstrings from all matches A failed/gmatch will reset the matchstart to the beginning of the string, unless the match is incombined/cg mode
s/// (Substitution)
s/PATTERN/REPLACEMENT/egimosx
MatchPATTERNin the input string, and replace the match text with
REPLACEMENT, returning the number of successes /imosx areoptional mode modifiers./gsubstitutes all occurrences ofPATTERN.Each/e causes an evaluation ofREPLACEMENT as Perl code
split
split /PATTERN/, EXPR,LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split
Return alist of substrings surrounding matches ofPATTERNinEXPR
IfLIMITis included, the list contains substrings surrounding thefirstLIMITmatches The pattern argument is a match operator, sousemif you want alternate delimiters (e.g.,split m{PATTERN}) Thematch permits the same modifiers as m{} Table 9 lists the after-match variables
Trang 32Perl supports the standard Unicode properties (see Table 3)
as well as Perl-specific composite properties (see Table 10).Scripts and properties may have an Is prefix, but do notrequire it Blocks require anInprefix only if the block nameconflicts with a script name
Table 9 Perl after-match variables
$+ Last parenthesized match
$' Text before match Causes all regular expressions to be slower
$^N Text of most recently closed capturing parentheses
$* If true,/m is assumed for all matches without a/s
$^R The result value of the most recently executed code construct
within a pattern match
Trang 33Example 1 Simple match
# Find Spider-Man, Spiderman, SPIDER-MAN, etc
my $dailybugle = "Spider-Man Menaces City!";
if ($dailybugle =~ m/spider[- ]?man/i) { do_something( ); }
Example 2 Match, capture group, and qr
# Match dates formatted like MM/DD/YYYY, MM-DD-YY,
Trang 34Perl 5.8 | 25
Other Resources
• Programming Perl, by Larry Wall et al (O’Reilly), is the
standard Perl reference
• Mastering Regular Expressions, Third Edition, by Jeffrey
E F Friedl (O’Reilly), covers the details of Perl regularexpressions on pages 283–364
• perlreis the perldoc documentation provided with mostPerl distributions
Example 3 Simple substitution
# Convert <br> to <br /> for XHTML compliance
my $text = "Hello World! <br>";
$text =~ s#<br>#<br />#ig;
Example 4 Harder substitution
# urlify - turn URLs into HTML links
$text = "Check the web site, http://www.oreilly.com/catalog/regexppr.";
# resource and colon
[\w/#~:.?+=&%@!\-] +? # one or more valid
Trang 35Java (java.util.regex)
Java 1.4 introduced regular expressions with Sun’s java.util.regexpackage Although there are competing packagesavailable for previous versions of Java, Sun’s is now the stan-dard Sun’s package uses a Traditional NFA match engine.For an explanation of the rules behind a Traditional NFAengine, see “Introduction to Regexes and Pattern Matching.”This section covers regular expressions in Java 1.5 and 1.6
Supported Metacharacters
java.util.regex supports the metacharacters and quences listed in Table 11 through Table 15 For expandeddefinitions of each metacharacter, see “Regex Metacharacters,Modes, and Constructs.”
metase-Table 11 Java character representations
\xhex Character specified by a two-digit hexadecimal code
\uhex Unicode character specified by a four-digit hexadecimal
code
\cchar Named control character
Trang 36Java (java.util.regex) | 27
Table 12 Java character classes and class-like constructs
[ ] A single character listed or contained in a listed range
[^ ] A single character not liste and not contained within a
listed range
Any character, except a line terminator (unlessDOTALL
mode)
\w Word character,[a-zA-Z0-9_]
\W Nonword character,[^a-zA-Z0-9_]
\d Digit,[0-9]
\D Nondigit,[^0-9]
\s Whitespace character,[ \t\n\f\r\x0B]
\S Nonwhitespace character,[^ \t\n\f\r\x0B]
\p{prop} Character contained by given POSIX character class,
Unicode property, or Unicode block
\P{prop} Character not contained by given POSIX character class,
Unicode property, or Unicode block
Table 13 Java anchors and other zero-width tests
^ Start of string, or the point after any newline if in
MULTILINE mode
\A Beginning of string, in any match mode
$ End of string, or the point before any newline if in
Trang 37(?= ) Positive lookahead.
(?! ) Negative lookahead
(?<= ) Positive lookbehind
(?<! ) Negative lookbehind
Table 14 Java comments and mode modifiers
Modifier/sequence Mode character Meaning
Pattern.UNIX_LINES d Treat\n as the only line
terminator
Pattern.DOTALL s Dot (.) matches any character,
including a line terminator
Pattern.MULTILINE m ^and$match next to embedded
line terminators
Pattern.COMMENTS x Ignore whitespace, and allow
embedded comments startingwith#
Pattern.CANON_EQ Unicode “canonical equivalence”
mode, where characters, orsequences of a base character andcombining characters withidentical visual representations,are treated as equals
(?mode) Turn listed modes (one or more of
idmsux) on for the rest of thesubexpression
(?-mode) Turn listed modes (one or more of
idmsux) off for the rest of thesubexpression
(?mode: ) Turn listed modes (one or more of
idmsux) on within parentheses
Table 13 Java anchors and other zero-width tests (continued)
Trang 38Java (java.util.regex) | 29
(?-mode: ) Turn listed modes (one or more of
idmsux) off within parentheses
# Treat rest of line as a comment in
\n Contains text matched by thenth capture group
$n In a replacement string, contains text matched by the
nth capture group
(?: ) Groups subpattern, but does not capture submatch
(?> ) Atomic grouping
| Try subpatterns in alternation
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactlyn times
{n,} Match at leastn times
{x,y} Match at leastx times, but no more thany times
*? Match 0 or more times, but as few times as possible
+? Match 1 or more times, but as few times as possible
?? Match 0 or 1 times, but as few times as possible
{n,}? Match at leastn times, but as few times as possible
{x,y}? Match at leastxtimes, no more thanytimes, and as few
times as possible
*+ Match 0 or more times, and never backtrack
++ Match 1 or more times, and never backtrack
Table 14 Java comments and mode modifiers (continued)
Modifier/sequence Mode character Meaning
Trang 39Regular Expression Classes and Interfaces
Regular expression functions are contained in two main classes,
java.util.regex.Pattern and java.util.regex.Matcher; a nexception, java.util.regex.PatternSyntaxException; and aninterface, CharSequence Additionally, the String class imple-ments the CharSequence interface to provide basic pattern-matching methods Pattern objects are compiled regularexpressions that can be applied to any CharSequence A
Matcheris a stateful object that scans for one or more rences of a Pattern applied in a string (or any objectimplementingCharSequence)
occur-Backslashes in regular expression String literals need to beescaped So,\n (newline) becomes\\nwhen used in aJava
String literal that is to be used as a regular expression
java.lang.String
Description
Methods for pattern matching
Methods
boolean matches(String regex)
Return true ifregex matches the entireString
String[ ] split(String regex)
Return an array of the substrings surrounding matches of
regex
?+ Match 0 or 1 times, and never backtrack
{n}+ Match at leastn times, and never backtrack
{n,}+ Match at leastn times, and never backtrack
{x,y}+ Match at leastx times, no more thany times, and never
backtrack
Table 15 Java grouping, capturing, conditional,
and control (continued)
Trang 40Java (java.util.regex) | 31
String [ ] split(String regex, int limit)
Return an array of the substrings surrounding the firstlimit-1
matches ofregex
String replaceFirst(String regex, String replacement)
Replace the substring matched byregex withreplacement
String replaceAll(String regex, String replacement)
Replace all substrings matched byregex withreplacement
java.util.regex.Pattern
Description
Models a regular expression pattern
Methods
static Pattern compile(String regex)
Construct aPattern object fromregex
static Pattern compile(String regex, int flags)
Construct anewPatternobject out ofregex, and the OR’dmode-modifier constantsflags
int flags( )
Return thePattern’s mode modifiers
Matcher matcher(CharSequence input)
Construct a Matcher object that will match this Pattern
againstinput
static boolean matches(String regex, CharSequence input)
Return true ifregex matches the entire stringinput
String pattern( )
Return the regular expression used to create thisPattern
static String quote(String text)
Escapes the text so that regular expression operators will bematched literally
String[ ] split(CharSequence input)
Return an array of the substrings surrounding matches of this
Pattern ininput
String[ ] split(CharSequence input, int limit)
Return an array of the substrings surrounding the firstlimit
matches of this pattern inregex