Does the previous paragraph make your eyes glaze over a bit?Well, it discusses just six tests, and for only one regex package we haven't even started to compar e these Regexp results aga
Trang 1There are many regex packages for Java; the list that followshas a few words about those that I investigated while
researching this book (See this book's web page, regex.info/,for links) The table below gives a superficial overview of some
of the differences among their flavors
Sun
java.util.regex Sun's own regex package, finally
standard as of Java 1.4 It's a solid, actively maintainedpackage that provides a rich Perl-like flavor It has the bestUnicode support of these packages It provides all the basicfunctionality you might need, but has only minimal
convenience functions It matches against CharSequence
objects, and so is extremely flexible in that respect Itsdocumentation is clear and complete It is the all-aroundfastest of the engines listed here This package is described
org.apache.xerces.utils.regex package, which I did notinvestigate) It's actively maintained, and provides a richPerl-like flavor, although is somewhat buggy in certain
areas It has very good Unicode support It can match
against char[], CharacterIterator, and String Overall,
Trang 2Possessive quantifiers Word boundaries \b \b \b \< \b \> \b \< \>
Trang 3maintained, and solid, although its version of a Perl-likeflavor is much less rich than either the Sun or the IBM
packages It has minimal Unicode support Overall, the
regex engine is notably slower than most other packages.Its \G is broken It can match against char[] and String
One of its strongest points is that it has a vast, modularstructure that exposes almost all of the mechanics that
surround the engine (the transmission, searchand- replacemechanics, etc.) so advanced users can tune it to suit theirneeds, but it also comes replete with a fantastic set of
convenience functions that makes it one of the easiest
packages to work with, particularly for those coming from a
Trang 4Version Tested: 1.5.3
License: GNU LGPL (GNU Lesser General Public License)
GNU
gnu.regexp The more advanced of the two "GNU regexpackages" for Java (The other, gnu.rex, is a very smallpackage providing only the most barebones regex flavorand support, and is not covered in this book.) It has somePerl-like features, and minimal Unicode support It's veryslow It's the only package with a POSIX NFA (although itsPOSIXness is a bit buggy at times)
Trang 5License: GNU LGPL (GNU Lesser General Public License)
Regexp
org.apache.regexp This is the other regex package underthe umbrella of the Apache Jakarta project It's somewhatpopular, but quite buggy It has the fewest features of thepackages listed here Its overall speed is on par with ORO.Not actively maintained Minimal Unicode support
is powerful, and its adoption by a wide variety of packages andlanguages has made it somewhat of a de facto standard
However, of the many packages, programs, and languages thatclaim to be "Perl5 compliant," none truly are Even Perl itselfdiffers from version to version as new features are added andbugs are fixed Some of the innovations new with early 5.x
versions of Perl were non-capturing parentheses, lazy
quantifiers, lookahead, inline mode modifiers like (?i) , andthe /x free-spacing mode (all discussed in Chapter 3) Packagessupporting only these features claim a "Perl5" flavor, but missout on later innovations, such as lookbehind, atomic grouping,and conditionals
Trang 6"Perl5" enhancements Sun's package, for example, supportspossessive quantifiers, and both Sun and IBM support characterclass set operations Pat offers an innovative way to do
lookbehind, and a way to allow matching of simple arbitrarilynested constructs
8.3.2 Lies, Damn Lies, and Benchmarks
It's probably a common twist on Sam Clemens' famous "lies,damn lies, and statistics" quote, but when I saw its use with
"benchmarks" in a paper from Sun while doing research for thischapter, I knew it was an appropriate introduction for this
section In researching these seven packages, I've run literallythousands of benchmarks, but the only fact that's clearly
emerged is that there are no clear conclusions
There are several things that cloud regex benchmarking withJava First, there are language issues Recall the benchmarkingdiscussion from Chapter 6 (see Section 6.3.2), and the specialissues that make benchmarking Java a slippery science at best(primarily, the effects of the Just-In-Time or Better-Late-Than-Never compiler) In doing these benchmarks, I've made sure touse a server VM that was "warmed up" for the benchmark (see
"BLTN" Section 6.3.2), to show the truest results
Then there are regex issues Due to the complex interactions ofthe myriad of optimizations like those discussed in Chapter 6, aseemingly inconsequential change while trying to test one
feature might tickle the optimization of an unrelated featur e,anonymously skewing the results one way or the other I didmany (many!) very specific tests, usually approaching an issuefrom multiple directions, and so I believe I've been able to getmeaningful results but one never truly knows
Trang 78.3.2.1 Warning: Benchmark results can cause drowsiness!
Just to show how slippery this all can be, recall that I judgedthe two Jakarta packages (ORO and Regexp) to be roughly
comparable in speed Indeed, they finished equally in some ofthe many benchmarks I ran, but for the most part, one
generally ran at least twice the speed of the other (sometimes10x or 20x the speed) But which was "one" and which "theother" changed depending upon the test
For example, I targeted the speed of greedy and lazy
quantifiers by applying ^.*: and ^.*?: to a very longstring like '···xxx:x' I expected the greedy one to be fasterthan the lazy one with this type of string, and indeed, it's thatway for every package, program, and language I tested except one For whatever reason, Jakarta's Regexp's ^.*:
performed 70% slower than its ^.*?: I then applied thesame expressions to a similarly long string, but this time onelike 'x:xxx···' where the ':' is near the beginning This shouldgive the lazy quantifier an edge, and indeed, with Regexp, theexpression with the lazy quantifier finished 670x faster than thegreedy To gain more insight, I applied ^[^:]*: to each
string This should be in the same ballpark, I thought, as thelazy version, but highly contingent upon certain optimizationsthat may or may not be included in the engine With Regexp, itfinished the test a bit slower than the lazy version, for bothstrings
Does the previous paragraph make your eyes glaze over a bit?Well, it discusses just six tests, and for only one regex package
we haven't even started to compar e these Regexp results
against ORO or any of the other packages When compar edagainst ORO, it turns out that Regexp is about 10x slower withfour of the tests, but about 20x faster with the other two! It'sfaster with ^.*?: and ^[^:]*: applied to the long string with
Trang 8quickly
Are you eyes completely glazed over yet? Let's try the same set
of six tests, but this time on short strings instead of very longones It turns out that Regexp is faster three to ten times faster
than ORO for all of them Okay, so what does this tell us?
Perhaps that ORO has a lot of clunky overhead that
overshadows the actual match time when the matches are
found quickly Or perhaps it means that Regexp is generallymuch faster, but has an inefficient mechanism for accessing thetarget string Or perhaps it's something else altogether I don'tknow
Another test involved an "exponential match" (see Section
6.1.4) on a short string, which tests the basic churning of anengine as it tracks and backtracks These tests took a long
time, yet Regexp tended to finish in half the time of ORO Therejust seems to be no rhyme nor reason to the results Such isoften the case when benchmarking something as complex as aregex engine
8.3.2.2 And the winner is
The mind-numbing statistics just discussed take into accountonly a small fraction of the many, varied tests I did In looking
at them all for Regexp and ORO, one package does not standout as being faster overall Rather, the good points and badpoints seem to be distributed fairly evenly between the two, so
I (perhaps somewhat arbitrarily) judge them to be about equal
Adding the benchmarks from the five other packages into themix results in a lot of drowsiness for your author, and no
obviously clear winner, but overall, Sun's package seems to be
Trang 9somewhat behind are Pat, Jregex, Regexp, and ORO The GNUpackage is clearly the slowest
The overall difference between Sun and IBM is not so obviouslyclear that another equally comprehensive benchmark suite
wouldn't show the opposite order if the suite happened to betweaked slightly differently than mine Or, for that matter, it'sentirely possible that someone looking at all my benchmarkdata would reach a different conclusion And, of course, theresults could change drastically with the next release of any ofthe packages or virtual machines (and may well have, by thetime you read this) It's a slippery science
In general, Sun did most things very well, but it's missing a fewkey optimizations, and some constructs (such as character
classes) are much slower than one would expect Over time,these will likely be addressed by Sun (and in fact, the slowness
of character classes is slated to be fixed in Java 1.4.2) Thesource code is available if you'd like to hack on it as well; I'msure Sun would appreciate ideas and patches that improve it
8.3.3 Recommendations
There are many reasons one might choose one package overanother, but Sun's java.util.regex packagewith its high
quality, speed, good Unicode support, advanced features, andfuture ubiquityis a good recommendation It comes integrated
as part of Java 1.4: String.matches(), for example, checks tosee whether the string can be completely matched by a givenregex
java.util.regex's strengths lie in its core engine, but it
doesn't have a good set of "convenience functions," a layer thathides much of the drudgery of bit-shuffling behind the scenes.ORO, on the other hand, while its core engine isn't as strong,
Trang 10java.util.regex with ORO sounds very appealing I've talked
to the ORO developer, and it seems likely that this will happen,
so the rest of this chapter looks at Sun's java.util.regex andORO's interface
Trang 13I'd like to start with the story about the evolution of some
regular expression flavors and their associated programs So,grab a hot cup (or frosty mug) of your favorite brewed
beverage and relax as we look at the sometimes wacky historybehind the regular expressions we have today The idea is toadd color to our regex understanding, and to develop a feeling
as to why "the way things are" are the way things are Thereare some footnotes for those that are interested, but for themost part, this should be read as a light story for enjoyment
3.1.1 The Origins of Regular Expressions
The seeds of regular expressions were planted in the early
1940s by two neurophysiologists, Warren McCulloch and WalterPitts, who developed models of how they believed the nervoussystem worked at the neuron level.[1] Regular expressions
became a reality several years later when mathematician
Stephen Kleene formally described these models in an algebra
he called regular sets He devised a simple notation to express these regular sets, and called them regular expressions.
[2] Robert L Constable, "The Role of Finite Automata in the Development of Modern Computing
Theory," in The Kleene Symposium, Eds Barwise, Keisler, and Kunen (North-Holland Publishing
Company, 1980), 61-83.
Trang 14able to find is Ken Thompson's 1968 article Regular Expression
Search Algorithm [3] in which he describes a regularexpressioncompiler that produced IBM 7094 object code This led to his
into its own utility, grep (after which egrep extended grep was
later modeled)
3.1.1.1 Grep's metacharacters
The regular expressions supported by grep and other early tools were quite limited when compared to egrep's The
metacharacter * was supported, but + and ? were not (the
latter's absence being a particularly strong drawback) grep's
capturing metacharacters were \(···\), with unescaped
parentheses representing literal text.[4] grep supported line
anchors, but in a limited way If ^ appeared at the beginning ofthe regex, it was a metacharacter matching the beginning ofthe line Otherwise, it wasn't a metacharacter at all and justmatched a literal circumflex (also called a "caret") Similarly, $
was the end-of-line metacharacter only at the end of the regex.The upshot was that you couldn't do something like
end$|^start But that's okay, since alternation wasn't
supported either!
Trang 15parentheses as delimiters because Ken Thompson felt regular expressions would be used to work primarily with C code, where needing to match raw parentheses would be more common than
Trang 16were + and ? added, but they could be applied to
Sometimes new bugs were introduced as features were added.Other times, added features were later removed There waslittle to no documentation for the many subtle points that roundout a tool's flavor, so new tools either made up their own style,
or attempted to mimic "what seemed to work" with other tools
Multiply that by the passage of time and numerous
programmers, and the result is general confusion (particularlywhen you try to deal with everything at once).[5]
[5] Such as when writing a book about regular expressionsask me, I know!
Trang 17POSIX, short for Portable Operating System Interface, is a
wide-ranging standard put forth in 1986 to ensure portabilityacross operating systems Several parts of this standard dealwith regular expressions and the traditional tools that use them,
so it's of some interest to us None of the flavors covered in thisbook, however, strictly adhere to all the relevant parts In aneffort to reorganize the mess that regular expressions had
to be internationalized Thy are not regex-specific concept,
although they can affect regular-expression use For example,
when working with a locale that describes the Latin-1 (8859-1) encoding, à and À (characters with ordinal values 224and 160, respectively) are considered "letters," and any
ISO-application of a regex that ignores capitalization would know totreat them as identical
Trang 18alternation
Another example is \w , commonly provided as a shorthand for
a "word-constituent character" (ostensibly, the same as Z0-9_] in many flavors) This feature is not required by POSIX,but it is allowed If supported, \w would know to allow all
[a-zA-letters and digits defined in the locale, not just those in ASCII
Note, however, that the need for this aspect of locales is mostlyalleviated when working with tools that support Unicode
Unicode is discussed further beginning in Section 3.3.2.2
3.1.1.6 Henry Spencer's regex package
Also first appearing in 1986, and perhaps of more importance,was the release by Henry Spencer of a regex package, written
in C, which could be freely incorporate by others into their ownprograms a first at the time Every program that used Henry'spackageand there were manyprovided the same consistent
regex flavor unless the program's author went to the explicittrouble to change it
3.1.1.7 Perl evolves
At about the same time, Larry Wall started developing a toolthat would later become the language Perl He had already
greatly enhanced distributed software development with his
patch program, but Perl was destined to have a truly
monumental impact
Larry released Perl Version 1 in December 1987 Perl was animmediate hit because it blended so many useful features ofother languages, and combined them with the explicit goal of
being, in a day-to-day practical sense, useful.
Trang 19expression operators in the tradition of the specialty tools sedand awk a first for a general scripting language For the regularexpression engine, Larry borrowed code from an earlier project,
his news reader rn (which based its regular expression code on
that in James Gosling's Emacs).[6] The regex flavor was
considered powerful by the day's standards, but was not nearly
as full-featured as it is today Its major drawbacks were that itsupported at most nine sets of parentheses, and at most ninealternatives with | , and worst of all, | was not allowed withinparentheses It did not support case-insensitive matching, norallow \w within a class (it didn't support \s or \d anywhere)
It didn't support the {min,max} range quantifier
since then it would match what characters were allowed in aPerl variable name Furthermore, these metacharacters werenow allowed inside classes (Their opposites, \D , \W , and \S
, were also newly supported, but weren't allowed within a class,
and in any case sometimes didn't work correctly.) Importantly,the /i modifier was added, so you could now do case-
insensitive matching
Perl 3 came out more than a year later, in October 1989 It
added the /e modifier, which greatly increased the power of thereplacement operator, and fixed some backr eference-relatedbugs from the previous version It added the {min,max} rangequantifiers, although unfortunately, they didn't always work
Trang 20breakthrough wouldn't happen until 1994
Perl 5 was officially released in October 1994 Overall, Perl had
undergone a massive overhaul, and the result was a vastly
superior language in every respect On the regular-expressionside, it had more internal optimizations, and a few
metacharacters were added (including \G , which increased thepower of iterative matches see Section 3.4.3.3), non-capturingparentheses (see Section 2.2.3.1), lazy quantifiers (see Section3.4.5.9), lookahead (see Section 2.3.5.1), and the /x modifier[7](see Section 2.3.6.4)
[7] My claim to fame is that Larry added the /x modifier after seeing a note from me discussing
a long and complex regex In the note, I had "pretty printed" the regular expression for clarity Upon seeing it, he thought that it would be convenient to do so in Perl code as well, so he added/x.
More important than just for their raw functionality, these
"outside the box" modifications made it clear that regular
expressions could really be a powerful programming languageunto themselves, and were still ripe for further development
The newly-added non-capturing parentheses and lookaheadconstructs required a way to be expressed None of the
grouping pairs (···), [···], <···>, or {···} were available
to be used for these new features, so Larry came up with thevarious '(?' notations we use today He chose this unsightly
Trang 21combination in a Perl regex, so he was free to give it meaning.One important consideration Larry had the foresight to
recognize was that there would likely be additional functionality
in the future, so by restricting what was allowed after the '(?'sequences, he was able to reserve them for future
enhancements
Subsequent versions of Perl grew more robust, with fewer bugs,more internal optimizations, and new features I like to believethat the first edition of this book played some small part in this,for as I researched and tested regex-related features, I wouldsend my results to Larry and the Perl Porters group, which
helped give some direction as to where improvements might bemade
New regex features added over the years include limited
lookbehind (see Section 2.3.5.1), "atomic" grouping (see
Section 3.4.5.4), and Unicode support Regular expressions
were brought to the next level by the addition of conditionalconstructs (see Section 3.4.5.6), allowing you to make if-then-else decisions right there as part of the regular expression And
building of web pages is just that, so Perl quickly became the
language for web development Perl became vastly more
popular, and with it, its powerful regular expression flavor did aswell
Trang 22compatible" to one extent or another were created Among
these were packages for Tcl, Python, Microsoft's NET suite oflanguages, Ruby, PHP, C/C++, and many packages for Java
3.1.1.9 Versions as of this book
Table 3-2 shows a few of the version numbers for programs andlibraries that I talk about in the book Older versions may wellhave fewer features and more bugs, while newer versions mayhave additional features and bug fixes (and new bugs of theirown)
Because Java did not originally come with regex support,
numerous regex libraries have been developed over the years,
so anyone wishing to use regular expressions in Java needed tofind them, evaluate them, and ultimately select one to use
Chapter 6 looks at seven such packages, and ways to evaluatethem For reasons discussed there, the regex package that Suneventually came up with (their java.util.regex, now standard
as of Java 1.4) is what I use for most of the Java examples inthis book
Perl 5.8 PHP ( preg routines) 4.0.6
Procmail 3.22 Python 2.2.1 Ruby1.6.7 GNU sed 3.02 Tcl 8.4
3.1.2 At a Glance
A chart showing just a few aspects of some common tools gives
a good clue to how different things still are Table 3-3 provides
Trang 23Foremost is that programs change over time For example, Tclused to not support backreferences and word boundaries, butnow does It first supported word boundaries with the ungainly-looking [:<:] and [:>:] , and still does, although such use isdeprecated in favor of its more-recently supported \m , \M ,and \y (start of word boundary, end of word boundary, or
either)
Along the same lines, programs such as grep and egrep, which
aren't from a single provider but rather can be provided by
anyone who wants to create them, can have whatever flavorthe individual author of the program wishes Human nature
being what is, each tends to have its own features and
peculiarities (The GNU versions of many common tools, forexample, are often more powerful and robust than other
versions.)
And perhaps as important as the easily visible features are the
Trang 24flavors Looking at the table, one might think that regular
expressions are exactly the same in Perl, NET, and Java, which
is certainly not true Just a few of the questions one might askwhen looking at something like Table 3-3 are:
Are star and friends allowed to quantify something wrapped
in parentheses?
Does dot match a newline? Do negated character classesmatch it? Do either match the null character?
Are the line anchors really line anchors (i.e., do they
recognize newlines that might be embedded within the
target string)? Are they first-class metacharacters, or arethey valid only in certain parts of the regex?
Are escapes recognized in character classes? What else is orisn't allowed within character classes?
Are parentheses allowed to be nested? If so, how deeply(and how many parentheses are even allowed in the firstplace)?
If backreferences are allowed, when a case-insensitive
match is requested, do backreferences match
appropriately? Do backreferences "behave" reasonably infringe situations?
Are octal escapes such as \123 allowed? If so, how do theyreconcile the syntactic conflict with backreferences? What
about hexadecimal escapes? Is it really the regex engine
that supports octal and hexadecimal escapes, or is it someother part of the utility?
Trang 25summary like Table 3-3 as a superficial guide (As another
example, peek ahead to Table 8-1 for a look at a chart showingsome differences among Java packages.) If you realize thatthere's a lot of dirty laundry behind that nice façade, it's not toodifficult to keep your wits about you and deal with it
As mentioned at the start of the chapter, much of this is justsuperficial syntax, but many issues go deeper For example,once you understand that something such as (Jul|July) in
egrep needs to be written as \(Jul\|July\) for GNU Emacs,you might think that everything is the same from there, butthat's not always the case The differences in the semantics ofhow a match is attempted (or, at least, how it appears to beattempted) is an extremely important issue that is often
overlooked, yet it explains why these two apparently identicalexamples would actually end up matching differently: one
always matches 'Jul', even when applied to 'July' Those verysame semantics also explain why the opposite, (July|Jul)
and \(July\|Jul\) , do match the same text Again, the
entire next chapter is devoted to understanding this
Of course, what a tool can do with a regular expression is often
more important than the flavor of its regular expressions Forexample, even if Perl's expressions were less powerful than
egrep's, Perl's flexible use of regexes provides for more raw
usefulness We'll look at a lot of individual features in this
chapter, and in depth at a few languages in later chapters
Trang 26Expression Features and Flavors
Now that you have a feel for regular expressions and a few
diverse tools that use them, you might think we're ready to diveinto using them wherever they're found But even a simple
comparison among the egrep versions of the first chapter and
the Perl and Java in the previous chapter shows that regularexpressions and the way they're used can vary wildly from tool
to tool
When looking at regular expressions in the context of their hostlanguage or tool, there are three broad issues to consider:
What metacharacters are supported, and their meaning.Often called the regex "flavor."
How regular expressions "interface" with the language ortool, such as how to specify regular-expression operations,what operations are allowed, and what text they operateon
How the regular-expression engine actually goes about
applying a regular expression to some text The methodthat the language or tool designer uses to implement theregular-expression engine has a strong influence on theresults one might expect from any given regular expression
Regular Expressions and Cars
The considerations just listed parallel the way one might thinkwhile shopping for a car With regular expressions, the
metacharacters are the first thing you notice, just as with a carit's the body shape, shine, and nifty features like a CD player
Trang 27splashed across the pages of a glossy brochure, and a list ofmetacharacters like the one in Section 1.5.6 is the regular-
expression equivalent It's important information, but only part
of the story
How regular expressions interface with their host program isalso important The interface is partly cosmetic, as in the syntax
of how to actually provide a regular expression to the program.Other parts of the interface are more functional, defining whatoperations are supported, and how convenient they are to use
In our car comparison, this would be how the car "interfaces"with us and our lives Some issues might be cosmetic, such aswhat side of the car you put gas in, or whether the windows arepowered Others might be a bit more important, such as if ithas an automatic or manual transmission Still others deal withfunctionality: can you fit the thing in your garage? Can you
transport a king-size mattress? Skis? Five adults? (And how
easy is it for those five adults to get in and out of the careasierwith four doors than with two.) Many of these issues are alsomentioned in the glossy brochure, although you might have toread the small print in the back to get all the details
The final concern is about the engine, and how it goes about itswork to turn the wheels Here is where the analogy ends,
because with cars, people tend to understand at least the
minimum required about an engine to use it well: if it's a
gasoline engine, they won't put diesel fuel into it And if it has amanual transmission, they won't forget to use the clutch But,
in the regular-expression world, even the most minute detailsabout how the regex engine goes about its work, and how thatinfluences how expressions should be crafted and used, are
usually absent from the documentation However, these detailsare so important to the practical use of regular expressions thatthe entire next chapter is devoted to it
In This Chapter
Trang 28differently Since that's not the case, knowing something aboutyour utility's computational pedigree adds interesting and
valuable insight
Trang 30$Count = $TimesToDo;
Trang 31the timing starts? ($TestString is initialized with Perl's
Trang 32These tests are about 5.3 and 4.1 seconds slower than the firsttests Most of the extra time is probably the overhead of
working with $Count The fact that ^(a|b|c|d|e|f|g)+$ ishit relatively harder (5.3 seconds slower than the first time,rather than 4.1 seconds slower) may reflect additional pre-
match (or earlymatch) setup by the regex engine before gettinginto the main part of the match
In any case, the point of this change is to illustrate that the
work overtime is part of the timing
results are strongly influenced by how much real work vs non-6.3.2 Benchmarking with Java
Benchmarking Java can be a slippery science, for a number ofreasons Let's first look at a somewhat nạve example, and thenlook at why it's nạve, and at what can be done to make it less
so The listing below shows the benchmark example with Java,using Sun's java.util.regex
Notice how the regular expressions are compiled in the
initialization part of the program? We want to benchmark thematching speed, not the compile speed
Trang 35second argument to each Regex constructor (see Section
Trang 38for i in 1 TimesToDo
Trang 40Alternation takes 0.362 seconds
Character class takes 0.352 seconds
Wow, they're both about the same speed! Well, recall from thetable on Section 4.1.3 that Tcl has a hybrid NFA/DFA engine,and these regular expressions are exactly the same to a DFAengine Most of what this chapter talks about simply does notapply to Tcl See the sidebar in Section 6.4.4.1.1 for more