Regular Expression Pocket Reference, 2nd Edition pot

Introduction to Regexes and Pattern Matching | 3Introduction to Regexes and Pattern Matching A regular expression is a string containing a combination of normal characters and special me

Trang 2

Regular Expression

Pocket Reference

Trang 5

by Tony Stubblebine

this book are based on Mastering Regular Expressions, by Jeffrey E F Friedl,

Editor: Andy Oram

Production Editor: SumitaMukherji

Copyeditor: Genevieve d’Entremont

Indexer: Johnna VanHoose Dinse

Cover Designer: Karen Montgomery

Interior Designer: David Futato

Printing History:

August 2003: First Edition

July 2007: Second Edition

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are

registered trademarks of O’Reilly Media, Inc The Pocket Reference series designations, Regular Expression Pocket Reference, the image of owls, and

related trade dress are trademarks of O’Reilly Media, Inc

Many of the designations used by manufacturers and sellers to distinguishtheir products are claimed as trademarks Where those designations appear

in this book, and O’Reilly Media, Inc was aware of a trademark claim, thedesignations have been printed in caps or initial caps

Java™is a trademark of Sun Microsystems, Inc Microsoft Internet Explorerand NET are registered trademarks of Microsoft Corporation Spider-Man

is a registered trademark of Marvel Enterprises, Inc

While every precaution has been taken in the preparation of this book, thepublisher and author assume no responsibility for errors or omissions, or fordamages resulting from the use of the information contained herein

Trang 6

Contents

Introduction to Regexes and Pattern Matching 3

Regex Metacharacters, Modes, and Constructs 5

Trang 7

.NET and C# 38

Supported Metacharacters 38 Regular Expression Classes and Interfaces 42

Supported Metacharacters 77 Pattern-Matching Methods and Objects 79

Trang 10

is well-formed.

Today, regular expressions are included in most ming languages, as well as in many scripting languages,editors, applications, databases, and command-line tools.This book aims to give quick access to the syntax andpattern-matching operations of the most popular of theselanguages so that you can apply your regular-expressionknowledge in any environment

program-The second edition of this book adds sections on Ruby andApache web server, common regular expressions, and alsoupdates existing languages

About This Book

This book starts with a general introduction to regularexpressions The first section describes and defines theconstructs used in regular expressions, and establishes thecommon principles of pattern matching The remaining sec-tions of the book are devoted to the syntax, features, andusage of regular expressions in various implementations.The implementations covered in this book are Perl, Java™,.NET and C#, Ruby, Python, PCRE, PHP, Apache web

server, vi editor, JavaScript, and shell tools.

Trang 11

Conventions Used in This Book

The following typographical conventions are used in thisbook:

Constant width italic

Used for text that should be replaced with user-suppliedvalues

Constant width bold

Used in examples for commands or other text thatshould be typed literally by the user

Acknowledgments

Jeffrey E F Friedl’s Mastering Regular Expressions (O’Reilly)

is the definitive work on regular expressions While writing, Irelied heavily on his book and his advice As a convenience,

this book provides page references to Mastering Regular

Expressions, Third Edition (MRE) for expanded discussion of

regular expression syntax and concepts

Nat Torkington and Linda Mui were excellent editors whoguided me through what turned out to be a tricky first edi-tion This edition was aided by the excellent editorial skills ofAndy Oram Sarah Burcham deserves special thanks forgiving me the opportunity to write this book, and for hercontributions to the “Shell Tools” section More thanks forthe input and technical reviews from Jeffrey Friedl, PhilipHazel, Steve Friedl, Ola Bini, Ian Darwin, Zak Greant, RonHitchens, A.M Kuchling, Tim Allwine, Schuyler Erle, DavidLents, Rabble, Rich Bowan, Eric Eisenhart, and Brad Merrill

Trang 12

Introduction to Regexes and Pattern Matching | 3

Introduction to Regexes and Pattern Matching

A regular expression is a string containing a combination of

normal characters and special metacharacters or quences The normal characters match themselves

metase-Metacharacters and metasequences are characters or sequences

of characters that represent ideas such as quantity, locations,

or types of characters The list in “Regex Metacharacters,Modes, and Constructs” shows the most common metachar-acters and metasequences in the regular expression world.Later sections list the availability of and syntax for sup-ported metacharacters for particular implementations ofregular expressions

Pattern matching consists of finding asection of text that is

described (matched) by a regular expression The underlying

code that searches the text is the regular expression engine.

You can predict the results of most matches by keeping tworules in mind:

1 The earliest (leftmost) match wins

Regular expressions are applied to the input starting atthe first character and proceeding toward the last Assoon as the regular expression engine finds a match, itreturns (See MRE 148–149.)

2 Standard quantifiers are greedy

Quantifiers specify how many times something can berepeated The standard quantifiers attempt to match asmany times as possible They settle for less than the max-imum only if this is necessary for the success of thematch The process of giving up characters and tryingless-greedy matches is called backtracking (See MRE151–153.)

Regular expression engines have differences based on theirtype There are two classes of engines: Deterministic FiniteAutomaton (DFA) and Nondeterministic Finite Automaton

Trang 13

(NFA) DFAs are faster, but lack many of the features of anNFA, such as capturing, lookaround, and nongreedy quanti-fiers In the NFA world, there are two types: traditional andPOSIX.

DFA engines

DFAs compare each character of the input string to theregular expression, keeping track of all matches inprogress Since each character is examined at most once,the DFA engine is the fastest One additional rule toremember with DFAs is that the alternation metase-quence is greedy When more than one option in analternation (foo|foobar) matches, the longest one isselected So, rule No 1 can be amended to read “thelongest leftmost match wins.” (See MRE 155–156.)

Traditional NFA engines

Traditional NFA engines compare each element of theregex to the input string, keeping track of positionswhere it chose between two options in the regex If anoption fails, the engine backtracks to the most recentlysaved position For standard quantifiers, the enginechooses the greedy option of matching more text; how-ever, if that option leads to the failure of the match, theengine returns to a saved position and tries a less greedypath The traditional NFA engine uses orderedalternation, where each option in the alternation is triedsequentially A longer match may be ignored if an earlieroption leads to a successful match So, here rule #1 can

be amended to read “the first leftmost match after greedyquantifiers have had their fill wins.” (See MRE 153–154.)

POSIX NFA engines

POSIX NFA Engines work similarly to Traditional NFAswith one exception: a POSIX engine always picks thelongest of the leftmost matches For example, the alter-nation cat|category would match the full word

“category” whenever possible, even if the first alternative(“cat”) matched and appeared earlier in the alternation

Trang 14

Regex Metacharacters, Modes, and Constructs

The metacharacters and metasequences shown here sent most available types of regular expression constructsand their most common syntax However, syntax and avail-ability vary by implementation

repre-Character representations

Many implementations provide shortcuts to represent acters that may be difficult to input (See MRE 115–118.)

char-Character shorthands

Most implementations have specific shorthands for the

alert, backspace, escape character, form feed, newline,

carriage return, horizontal tab, a nd vertical tab

characters For example,\nis often ashorthand for thenewline character, which is usually LF (012 octal), butcan sometimes be CR (015 octal), depending on the oper-ating system Confusingly, many implementations use\b

to mean both backspace and word boundary (positionbetween a “word” character and a nonword character).For these implementations,\bmeansbackspacein achar-acter class (a set of possible characters to match in thestring), and word boundary elsewhere

Octal escape: \num

Represents a character corresponding to a two- or digit octal number For example, \015\012 matches anASCII CR/LF sequence

Hex and Unicode escapes: \xnum , \x{num} , \unum , \Unum

Represent characters corresponding to hexadecimal bers Four-digit and larger hex numbers can represent therange of Unicode characters For example, \x0D\x0A

num-matches an ASCII CR/LF sequence

Control characters: \cchar

Corresponds to ASCII control characters encoded withvalues less than 32 To be safe, always use an uppercase

char—some implementations do not handle lowercase

Trang 15

representations For example,\cHmatches Control-H, anASCII backspace character.

Character classes and class-like constructs

Character classes are used to specify a set of characters A

char-acter class matches a single charchar-acter in the input string that iswithin the defined set of characters (See MRE 118–128.)

Normal classes:[ ] and[^ ]

Character classes, [ ], and negated character classes,

[^ ], allow you to list the characters that you do or donot want to match A character class always matches onecharacter The- (dash) indicates a range of characters.For example,[a-z]matches any lowercase ASCII letter

To include the dash in the list of characters, either list itfirst, or escape it

Almost any character: dot (.)

Usually matches any character except a newline ever, the match mode usually can be changed so that dotalso matches newlines Inside a character class, dotmatches just a dot

How-Class shorthands:\w,\d,\s,\W,\D,\S

Commonly provided shorthands for word character,digit, and space character classes A word character isoften all ASCII alphanumeric characters plus the under-score However, the list of alphanumerics can includeadditional locale or Unicode alphanumerics, depending

on the implementation A lowercase shorthand (e.g.,\s)matches a character from the class; uppercase (e.g.,\S)matches a character not from the class For example,\d

matches a single digit character, and is usually lent to[0-9]

equiva-POSIX character class: [:alnum:]

POSIX defines several character classes that can be usedonly within regular expression character classes (seeTable 1) Take, for example,[:lower:] When written as

Trang 16

Unicode properties, scripts, and blocks: \p{prop} , \P{prop}

The Unicode standard defines classes of characters thathave a particular property, belong to a script, or exist

within ablock Properties are the character’s defining

char-acteristics, such as being a letter or a number (see Table 2)

Scripts are systems of writing, such as Hebrew, Latin, or

Han Blocks are ranges of characters on the Unicode

char-acter map Some implementations require that Unicodeproperties be prefixed withIsorIn For example,\p{Ll}

matches lowercase letters in any Unicode-supported guage, such asa orα

lan-Unicode combining character sequence:\X

Matches a Unicode base character followed by anynumber of Unicode-combining characters This is ashorthand for\P{M}\p{M} For example,\Xmatchesè;aswell as the two characterse'

Table 1 POSIX character classes

Digit Decimal digits

Graph Printing characters, excluding space

Lower Lowercase letters

Print Printing characters, including space

Punct Printing characters, excluding letters and digits

Space Whitespace

Upper Uppercase letters

Xdigit Hexadecimal digits

Trang 17

Table 2 Standard Unicode properties

\p{Lu} Uppercase letters

\p{C} Control codes and characters not in other categories

\p{Cc} ASCII and Latin-1 control characters

\p{Cf} Nonvisible formatting characters

\p{Cn} Unassigned code points

\p{Co} Private use, such as company logos

\p{Cs} Surrogates

\p{M} Marks meant to combine with base characters, such as accent

marks

\p{Mc} Modification characters that take up their own space Examples

include “vowel signs.”

\p{Me} Marks that enclose other characters, such as circles, squares, and

diamonds

\p{Mn} Characters that modify other characters, such as accents and

umlauts

\p{N} Numeric characters

\p{Nd} Decimal digits in various scripts

\p{Nl} Letters that represent numbers, such as Roman numerals

\p{No} Superscripts, symbols, or nondigit characters representing

numbers

\p{P} Punctuation

\p{Pc} Connecting punctuation, such as an underscore

\p{Pd} Dashes and hyphens

\p{Pe} Closing punctuation complementing\p{Ps}

\p{Pi} Initial punctuation, such as opening quotes

Trang 18

Anchors and zero-width assertions

Anchors and “zero-width assertions” match positions in theinput string (See MRE 128–134.)

Start of line/string:^,\A

Matches at the beginning of the text being searched Inmultiline mode, ^ matches after any newline Someimplementations support\A, which matches only at thebeginning of the text

End of line/string:$,\Z,\z

$ matches at the end of a string In multiline mode, $

matches before any newline When supported,\Zmatchesthe end of string or the point before astring-ending new-line, regardless of match mode Some implementationsalso provide\z, which matches only the end of the string,regardless of newlines

\p{Pf} Final punctuation, such as closing quotes

\p{Po} Other punctuation marks

\p{Ps} Opening punctuation, such as opening parentheses

\p{S} Symbols

\p{Sc} Currency

\p{Sk} Combining characters represented as individual characters

\p{Sm} Math symbols

\p{So} Other symbols

\p{Z} Separating characters with no visual representation

Trang 19

Start of match:\G

In iterative matching,\Gmatches the position where theprevious match ended Often, this spot is reset to thebeginning of a string on a failed match

Word boundary:\b,\B,\<,\>

Word boundary metacharacters match a location where aword character is next to a nonword character.\b oftenspecifies a word boundary location, and\Boften specifies anot-word-boundary location Some implementations pro-vide separate metasequences for start- and end-of-wordboundaries, often\< and\>

Lookahead:(?= ),(?! )

Lookbehind:(?<= ),(?<! )

Lookaround constructs match a location in the text where

the subpattern would match (lookahead), would notmatch (negative lookahead), would have finished match-ing (lookbehind), or would not have finished matching(negative lookbehind) For example,foo(?=bar)matches

fooinfoobar, but notfood Implementations often limitlookbehind constructs to subpatterns with a predeter-mined length

Comments and mode modifiers

Mode modifiers change how the regular expression engineinterprets a regular expression (See MRE 110–113, 135–136.)

Trang 20

Mode modifiers:(?i),(?-i), (?mod: )

Usually, mode modifiers may be set within a regularexpression with(?mod)to turn modes on for the rest ofthe current subexpression;(?-mod)to turn modes off forthe rest of the current subexpression; and(?mod: )toturn modes on or off between the colon and the closingparentheses For example, use (?i:perl) matches useperl,use Perl,use PeRl, etc

Literal-text span:\Q \E

Escapes metacharacters between\Qand\E For example,

\Q(.*)\E is the same as$\.\*$

Grouping, capturing, conditionals, and control

This section covers syntax for grouping subpatterns, ing submatches, conditional submatches, and quantifying thenumber of times a subpattern matches (See MRE 137–142.)

captur-Capturing and grouping parentheses:( ) and\1,\2, etc.

Parentheses perform two functions: grouping and ing Text matched by the subpattern within parentheses iscaptured for later use Capturing parentheses are num-bered by counting their opening parentheses from the left

captur-If backreferences are available, the submatch can bereferred to later in the same match with\1,\2, etc The

Trang 21

captured text is made available after a match byimplementation-specific methods For example,\b(\w+)\b

\s+\1\b matches duplicate words, such asthe the

Grouping-only parentheses:(?: )

Groups a subexpression, possibly for alternation or fiers, but does not capture the submatch This is useful forefficiency and reusability For example,(?:foobar)matches

quanti-foobar, but does not save the match to a capture group

Named capture: (?<name> )

Performs capturing and grouping, with captured text laterreferenced byname For example,Subject:(?<subject>.*)

captures the text following Subject: to acapture groupthat can be referenced by the namesubject

matches the wordsfoo orbar

Conditional: (?(if)then |else)

The if is implementation-dependent, but generally is areference to a captured subexpression or a lookaround.Thethenandelseparts are both regular expression pat-terns If theifpart is true, thethenis applied Otherwise,

elseis applied For example,(<)?foo(?(1)>|bar)matches

<foo> as well asfoobar

Greedy quantifiers:*,+,?, {num , num }

The greedy quantifiers determine how many times a struct may be applied They attempt to match as manytimes as possible, but will backtrack and give up matches

Trang 22

con-Regular Expression Cookbook | 13

Lazy quantifiers:*?,+?,??, {num , num }?

Lazy quantifiers control how many times a construct may

be applied However, unlike greedy quantifiers, theyattempt to match as few times as possible For example,

(an)+? matches onlyan ofbanana

Possessive quantifiers:*+,++,?+, {num , num }+

Possessive quantifiers are like greedy quantifiers, exceptthat they “lock in” their match, disallowing later back-tracking to break up the submatch For example,

(ab)++ab will not matchababababab

Unicode Support

The Unicode character set gives unique numbers to the

characters in all the world’s languages Because of the largenumber of possible characters, Unicode requires more thanone byte to represent a character Some regular expressionimplementations will not understand Unicode charactersbecause they expect 1 byte ASCII characters Basic supportfor Unicode characters starts with the ability to match a lit-eral string of Unicode characters Advanced support includescharacter classes and other constructs that incorporate char-acters from all Unicode-supported languages For example,\w

might matchè; as well ase

Regular Expression Cookbook

This section contains simple versions of common regularexpression patterns You may need to adjust them to meetyour needs

Each expression is presented here with target strings that itmatches, and target strings that it does not match, so you canget a sense of what adjustments you may need to make foryour own use cases

They are written in the Perl style:

/pattern/mode

s/pattern/replacement/mode

Trang 23

Removing leading and trailing whitespace

s/^\s+//

s/\s+$//

Matches: " foo bar ", "foo "

Nonmatches:"foo bar"

Trang 24

Regular Expression Cookbook | 15

Match date: MM/DD/YYYY HH:MM:SS

Matches: tony@example.com, tony@i-e.com, tony@mail.example.museum

Nonmatches:.@example.com, tony@i-.com, tony@example.a

(See MRE 70.)

Trang 25

HTTP URL

/(https?):\/\/([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+

[a-zA-Z]{2,9})

(:\d{1,4})?([-\w\/#~:.?+=&%@~]*)/

Matches: https://example.com,http://foo.com:8080/bar.html

Nonmatches: ftp://foo.com, ftp://foo.com/

Perl 5.8

Perl provides a rich set of regular-expression operators, structs, and features, with more being added in each newrelease Perl uses a Traditional NFA match engine For anexplanation of the rules behind an NFA engine, see “Intro-duction to Regexes and Pattern Matching.”

con-This reference covers Perl version 5.8 A number of new tures will be introduced in Perl 5.10; these are covered inTable 8 Unicode features were introduced in 5.6, but didnot stabilize until 5.8 Most other features work in versions5.004 and later

fea-Supported Metacharacters

Perl supports the metacharacters and metasequences listed inTable 3 through Table 7 To learn more about expanded def-initions of each metacharacter, see “Regex Metacharacters,Modes, and Constructs.”

Table 3 Perl character representations

Sequence Meaning

\a Alert (bell)

\b Backspace; supported only in character class (outside of

character class matches a word boundary)

\e Esc character,x1B

\n Newline;x0A on Unix and Windows,x0D on Mac OS 9

\r Carriage return;x0D on Unix and Windows,x0A on Mac OS 9

Trang 26

Perl 5.8 | 17

\f Form feed,x0C

\t Horizontal tab,x09

\octal Character specified by a two- or three-digit octal code

\xhex Character specified by a one- or two-digit hexadecimal code

\x{hex} Character specified by any hexadecimal code

\cchar Named control character

\N{name} A named character specified in the Unicode standard or listed in

PATH_TO_PERLLIB/unicode/Names.txt; requiresusecharnames ':full'

Table 4 Perl character classes and class-like constructs

[ ] A single character listed, or contained in a listed range

[^ ] A single character not listed, and not contained within a listed

range

[:class:] POSIX-style character class valid only within a regex character

class

Any character except newline (unless single-line mode,/s)

\C One byte; however, this may corrupt a Unicode character

stream

\X Base character, followed by any number of Unicode combining

characters

\w Word character,\p{IsWord}

\W Nonword character,\P{IsWord}

\d Digit character,\p{IsDigit}

\D Nondigit character,\P{IsDigit}

\s Whitespace character,\p{IsSpace}

\S Nonwhitespace character,\P{IsSpace}

\p{prop} Character contained by given Unicode property, script, or block

\P{prop} Character not contained by given Unicode property, script, or

block

Table 3 Perl character representations (continued)

Trang 27

Table 5 Perl anchors and zero-width tests

^ Start of string, or, in multiline match mode (/m), the position

after any newline

\A Start of search string, in all match modes

$ End of search string or the point before a string-ending newline,

or, in multiline match mode (/m), the position before anynewline

\Z End of string, or the point before a string-ending newline, in any

match mode

\z End of string, in any match mode

\G Beginning of current search

\b Word boundary

\B Not-word-boundary

(?= ) Positive lookahead

(?! ) Negative lookahead

(?<= ) Positive lookbehind; fixed-length only

(?<! ) Negative lookbehind; fixed-length only

Table 6 Perl comments and mode modifiers

Modifier Meaning

/i Case-insensitive matching

/m ^ and$ match next to embedded\n

/s Dot (.) matches newline

/x Ignore whitespace, and allow comments (#) in pattern

/o Compile pattern only once

(?mode) Turn listed modes (one or more ofxsmi) on for the rest of the

Trang 28

Perl 5.8 | 19

(?# ) Treat substring as a comment

# Treat rest of line as a comment in/x mode

\u Force next character to uppercase

\l Force next character to lowercase

\U Force all following characters to uppercase

\L Force all following characters to lowercase

\Q Quote all following regex metacharacters

\E End a span started with\U,\L, or\Q

Table 7 Perl grouping, capturing, conditional,

and control

( ) Group subpattern and capture submatch into\1,\2, and$1,

$2,

\n Contains text matched by thenth capture group

(?: ) Groups subpattern, but does not capture submatch

(?> ) Atomic grouping

| Try subpatterns in alternation

* Match 0 or more times

+ Match 1 or more times

? Match 1 or 0 times

{n} Match exactlyn times

{n,} Match at leastn times

{x,y} Match at leastx times, but no more thany times

*? Match 0 or more times, but as few times as possible

+? Match 1 or more times, but as few times as possible

?? Match 0 or 1 times, but as few times as possible

{n,}? Match at leastn times, but as few times as possible

{x,y}? Match at leastx times, and no more thany times, but as few

times as possible

Table 6 Perl comments and mode modifiers (continued)

Trang 29

)

Match with if-then-else pattern, whereCOND is an integerreferring to a backreference, or a lookaround assertion

(?(COND) ) Match with if-then pattern

(?{CODE}) Execute embedded Perl code

(??{CODE}) Match regex from embedded Perl code

Table 8 New features in Perl 5.10

Backreference to named capture group

%+ Hash reference to the leftmost capture of a given name,

$+{foo}

%- Hash reference to an array of all captures of a given name,

$-{foo}[0]

\g{n} or\gn Back reference to thenth capture

\g{-n} or\g-n Relative backreference to thenth previous capture

(?n) Recurse into thenth capture buffer

(?&NAME) Recurse into the named capture buffer

(?R) Recursively call the entire expression

(?(DEFINE) ) Define a subexpression that can be recursed into

(*FAIL) Fail submatch, and force the engine to backtrack

(*ACCEPT) Force engine to accept the match, even if there is more pattern

to check

(*PRUNE) Cause the match to fail from the current starting position

(*MARK:name) Marks and names the current position in the string The position

is available in$REGMARK

(*SKIP:name) Reject all matches up to the point where the namedMARK was

executed

(*THEN) When backtracked into, skip to the next alternation

Table 7 Perl grouping, capturing, conditional,

and control (continued)

Trang 30

Perl 5.8 | 21

Regular Expression Operators

Perl provides the built-in regular expression operatorsqr//,

m//, a nds///, as well as thesplit function Each operatoraccepts a regular expression pattern string that is run throughstring and variable interpolation, and then compiled.Regular expressions are often delimited with the forwardslash, but you can pick any nonalphanumeric, non-whitespace character Here are some examples:

Using the single quote as a delimiter suppresses interpolation

of variables and the constructs\N{name},\u,\l,\U,\L,\Q, a nd

\E Normally, these are interpolated before being passed tothe regular expression engine

qr// (Quote Regex)

qr/PATTERN/ismxo

Quote and compilePATTERNas a regular expression The returnedvalue may be used in a later pattern match or substitution Thissaves time if the regular expression is going to be interpolatedrepeatedly The match modes (or lack of),/ismxo, are locked in

(*COMMIT) When backtracked into, cause the match to fail outright

/p Mode modifier that enables the${^PREMATCH},${MATCH},

and ${^POSTMATCH} variables

\K Exclude previously matched text from the final match

Table 8 New features in Perl 5.10 (continued)

Trang 31

m// (Matching)

m/PATTERN/imsxocg

MatchPATTERNagainst input string In list context, returns a list ofsubstrings matched by capturing parentheses, or else(1) for asuccessful match or ( ) for a failed match In scalar context,returns1for success, or""for failure./imsxoare optional modemodifiers./cgare optional match modifiers./gin scalar contextcauses the match to start from the end of the previous match Inlist context, a /g match returns all matches, or all capturedsubstrings from all matches A failed/gmatch will reset the matchstart to the beginning of the string, unless the match is incombined/cg mode

s/// (Substitution)

s/PATTERN/REPLACEMENT/egimosx

MatchPATTERNin the input string, and replace the match text with

REPLACEMENT, returning the number of successes /imosx areoptional mode modifiers./gsubstitutes all occurrences ofPATTERN.Each/e causes an evaluation ofREPLACEMENT as Perl code

split

split /PATTERN/, EXPR,LIMIT

split /PATTERN/, EXPR

split /PATTERN/

split

Return alist of substrings surrounding matches ofPATTERNinEXPR

IfLIMITis included, the list contains substrings surrounding thefirstLIMITmatches The pattern argument is a match operator, sousemif you want alternate delimiters (e.g.,split m{PATTERN}) Thematch permits the same modifiers as m{} Table 9 lists the after-match variables

Trang 32

Perl supports the standard Unicode properties (see Table 3)

as well as Perl-specific composite properties (see Table 10).Scripts and properties may have an Is prefix, but do notrequire it Blocks require anInprefix only if the block nameconflicts with a script name

Table 9 Perl after-match variables

$+ Last parenthesized match

$' Text before match Causes all regular expressions to be slower

$^N Text of most recently closed capturing parentheses

$* If true,/m is assumed for all matches without a/s

$^R The result value of the most recently executed code construct

within a pattern match

Trang 33

Example 1 Simple match

# Find Spider-Man, Spiderman, SPIDER-MAN, etc

my $dailybugle = "Spider-Man Menaces City!";

if ($dailybugle =~ m/spider[- ]?man/i) { do_something( ); }

Example 2 Match, capture group, and qr

# Match dates formatted like MM/DD/YYYY, MM-DD-YY,

Trang 34

Perl 5.8 | 25

Other Resources

• Programming Perl, by Larry Wall et al (O’Reilly), is the

standard Perl reference

• Mastering Regular Expressions, Third Edition, by Jeffrey

E F Friedl (O’Reilly), covers the details of Perl regularexpressions on pages 283–364

• perlreis the perldoc documentation provided with mostPerl distributions

Example 3 Simple substitution

# Convert to for XHTML compliance

my $text = "Hello World! ";

$text =~ s# # #ig;

Example 4 Harder substitution

# urlify - turn URLs into HTML links

$text = "Check the web site, http://www.oreilly.com/catalog/regexppr.";

# resource and colon

[\w/#~:.?+=&%@!\-] +? # one or more valid

Trang 35

Java (java.util.regex)

Java 1.4 introduced regular expressions with Sun’s java.util.regexpackage Although there are competing packagesavailable for previous versions of Java, Sun’s is now the stan-dard Sun’s package uses a Traditional NFA match engine.For an explanation of the rules behind a Traditional NFAengine, see “Introduction to Regexes and Pattern Matching.”This section covers regular expressions in Java 1.5 and 1.6

Supported Metacharacters

java.util.regex supports the metacharacters and quences listed in Table 11 through Table 15 For expandeddefinitions of each metacharacter, see “Regex Metacharacters,Modes, and Constructs.”

metase-Table 11 Java character representations

\xhex Character specified by a two-digit hexadecimal code

\uhex Unicode character specified by a four-digit hexadecimal

code

\cchar Named control character

Trang 36

Java (java.util.regex) | 27

Table 12 Java character classes and class-like constructs

[ ] A single character listed or contained in a listed range

[^ ] A single character not liste and not contained within a

listed range

Any character, except a line terminator (unlessDOTALL

mode)

\w Word character,[a-zA-Z0-9_]

\W Nonword character,[^a-zA-Z0-9_]

\d Digit,[0-9]

\D Nondigit,[^0-9]

\s Whitespace character,[ \t\n\f\r\x0B]

\S Nonwhitespace character,[^ \t\n\f\r\x0B]

\p{prop} Character contained by given POSIX character class,

Unicode property, or Unicode block

\P{prop} Character not contained by given POSIX character class,

Unicode property, or Unicode block

Table 13 Java anchors and other zero-width tests

^ Start of string, or the point after any newline if in

MULTILINE mode

\A Beginning of string, in any match mode

$ End of string, or the point before any newline if in

Trang 37

(?= ) Positive lookahead.

(?! ) Negative lookahead

(?<= ) Positive lookbehind

(?<! ) Negative lookbehind

Table 14 Java comments and mode modifiers

Modifier/sequence Mode character Meaning

Pattern.UNIX_LINES d Treat\n as the only line

terminator

Pattern.DOTALL s Dot (.) matches any character,

including a line terminator

Pattern.MULTILINE m ^and$match next to embedded

line terminators

Pattern.COMMENTS x Ignore whitespace, and allow

embedded comments startingwith#

Pattern.CANON_EQ Unicode “canonical equivalence”

mode, where characters, orsequences of a base character andcombining characters withidentical visual representations,are treated as equals

(?mode) Turn listed modes (one or more of

idmsux) on for the rest of thesubexpression

(?-mode) Turn listed modes (one or more of

idmsux) off for the rest of thesubexpression

(?mode: ) Turn listed modes (one or more of

idmsux) on within parentheses

Table 13 Java anchors and other zero-width tests (continued)

Trang 38

(?-mode: ) Turn listed modes (one or more of

idmsux) off within parentheses

# Treat rest of line as a comment in

\n Contains text matched by thenth capture group

$n In a replacement string, contains text matched by the

nth capture group

(?: ) Groups subpattern, but does not capture submatch

(?> ) Atomic grouping

| Try subpatterns in alternation

* Match 0 or more times

+ Match 1 or more times

? Match 1 or 0 times

{n} Match exactlyn times

{n,} Match at leastn times

{x,y} Match at leastx times, but no more thany times

*? Match 0 or more times, but as few times as possible

+? Match 1 or more times, but as few times as possible

?? Match 0 or 1 times, but as few times as possible

{n,}? Match at leastn times, but as few times as possible

{x,y}? Match at leastxtimes, no more thanytimes, and as few

times as possible

*+ Match 0 or more times, and never backtrack

++ Match 1 or more times, and never backtrack

Table 14 Java comments and mode modifiers (continued)

Modifier/sequence Mode character Meaning

Trang 39

Regular Expression Classes and Interfaces

Regular expression functions are contained in two main classes,

java.util.regex.Pattern and java.util.regex.Matcher; a nexception, java.util.regex.PatternSyntaxException; and aninterface, CharSequence Additionally, the String class imple-ments the CharSequence interface to provide basic pattern-matching methods Pattern objects are compiled regularexpressions that can be applied to any CharSequence A

Matcheris a stateful object that scans for one or more rences of a Pattern applied in a string (or any objectimplementingCharSequence)

occur-Backslashes in regular expression String literals need to beescaped So,\n (newline) becomes\\nwhen used in aJava

String literal that is to be used as a regular expression

java.lang.String

Description

Methods for pattern matching

Methods

boolean matches(String regex)

Return true ifregex matches the entireString

String[ ] split(String regex)

Return an array of the substrings surrounding matches of

regex

?+ Match 0 or 1 times, and never backtrack

{n}+ Match at leastn times, and never backtrack

{n,}+ Match at leastn times, and never backtrack

{x,y}+ Match at leastx times, no more thany times, and never

backtrack

Table 15 Java grouping, capturing, conditional,

and control (continued)

Trang 40

String [ ] split(String regex, int limit)

Return an array of the substrings surrounding the firstlimit-1

matches ofregex

String replaceFirst(String regex, String replacement)

Replace the substring matched byregex withreplacement

String replaceAll(String regex, String replacement)

Replace all substrings matched byregex withreplacement

java.util.regex.Pattern

Description

Models a regular expression pattern

Methods

static Pattern compile(String regex)

Construct aPattern object fromregex

static Pattern compile(String regex, int flags)

Construct anewPatternobject out ofregex, and the OR’dmode-modifier constantsflags

int flags( )

Return thePattern’s mode modifiers

Matcher matcher(CharSequence input)

Construct a Matcher object that will match this Pattern

againstinput

static boolean matches(String regex, CharSequence input)

Return true ifregex matches the entire stringinput

String pattern( )

Return the regular expression used to create thisPattern

static String quote(String text)

Escapes the text so that regular expression operators will bematched literally

String[ ] split(CharSequence input)

Return an array of the substrings surrounding matches of this

Pattern ininput

String[ ] split(CharSequence input, int limit)

Return an array of the substrings surrounding the firstlimit

matches of this pattern inregex

Tiêu đề	Regular Expression Pocket Reference
Tác giả	Tony Stubblebine
Thể loại	Sách hướng dẫn
Năm xuất bản	2007

Định dạng
Số trang	128
Dung lượng	0,98 MB