Your task is to select alloccurrences of sequences of characters that represent dates assume for this example that dates areexpressed only using digits and separators and are not express
Trang 2However, in PowerGrep, the regular expression pattern [t-r]ightwon’t compile and produces theerror shown in Figure 5-14.
Figure 5-14
There is, typically, no advantage in attempting to use reverse ranges in character classes, and I suggestthat you avoid using these
A Potential Range Trap
Suppose that you want to allow for different separators in dates occurring in a document or set of ments Among the issues this problem throws up is a possible trap in expressing character ranges
docu-As a first test document, we will use Dates.txt, shown here:
2004-12-312001/09/112003.11.192002/04/292000/10/192005/08/282006/09/18
Trang 3As you can see, in this file the dates are in YYYY/MM/DD format, but sometimes the dates use thehyphen as a separator, sometimes the forward slash, and sometimes the period Your task is to select alloccurrences of sequences of characters that represent dates (assume for this example that dates areexpressed only using digits and separators and are not expressed using names of months, for example).
So if you wanted to select all dates, whether they use hyphens, forward slashes, or periods as separators,you might try a regular expression pattern like this:
(20|19)[0-9]{2}[.-/][01][0-9][.-/][0123][0-9]
In the character class [.-/], which you attempt to use to match the separator, the sequence of characters(period followed by hyphen followed by forward slash) is interpreted as the range from the period to theforward slash However, as you can see in the top row of Figure 5-15, the hyphen is U+002D, and theperiod (U+002E) is the character immediately before the forward slash (U+002F) So, undesirably, thepattern -/specifies a range that contains only the period and forward-slash characters
Figure 5-15
Characters can be expressed using Unicode numeric references The period is
U+002E; uppercase Ais U+0041 The Windows Character Map shows this syntax for
characters if you hover the mouse over characters of interest.
Trang 4To use the hyphen without creating a range, the hyphen should be the first character in the characterclass:
[-./]
This gives a pattern that will match each of the sample dates in the file Dates.txt:(20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]
Try It Out Matching Dates
1. Open PowerGrep, and enter the regular expression pattern (20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]
in the Searc text box
2. Enter C:\BRegExp\Ch05 in the Folder: text box, assuming that you have saved the Chapter 5
files from the download in that directory
3. Enter Dates.txt in the File Mask text box.
4. Click the Search button, and inspect the results shown in Figure 5-16 Notice particularly thatthe first match, 2004-12-31, includes a hyphen confirming that the regular expression patternworks as desired
Trang 5The next component of the pattern, [01], matches the numeric digits 0or 1, because months alwayshave 0or 1as the first digit in this date format Similarly, the next component, the character class [0-9],matches any number from 0through 9 This would allow numbers for the month such as 14or 18,which are obviously undesirable One of the exercises at the end of this chapter will ask you to provide
a more specific pattern that would allow only values from 01to 12inclusive
Next, the character class pattern [-./] matches a single character that is a hyphen, a period, or a forward slash.Finally, the pattern [0123][0-9]matches days of the month beginning with 0, 1, 2, or 3 As written, thepattern would allow values for the day of the month such as 00, 34or 38 A later exercise will ask you tocreate a more specific pattern to constrain values to 01through 31
Finding HTML Heading Elements
One potential use for characters classes is in finding HTML/XHTML heading elements As you probablyknow, HTML and XHTML 1.0 have six heading elements: h1, h2, h3, h4, h5, and h6 In XHTML the hmust be lowercase In HTML it is permitted to be hor H
First, assume that all the elements are written using a lowercase h So it would be possible to match thestart tag of all six elements, assuming that there are no attributes, using a fairly cumbersome regularexpression with parentheses:
<(h1|h2|h3|h4|h5|h6)>
In this case the <character is the literal left angled bracket, which is the first character in the start tag.Then there is a choice of six two-character sequences representing the element type of each HTML/XHTML heading element Finally, a >is the final literal character of the start tag
However, because there is a sequence of numbers from 1to 6, you can use a character class to match thesame start tags, either by listing each number literally:
<h[123456]>
or by using a range in the character class:
<h[1-6]>
The sample file, HTMLHeaders.txt, is shown here:
<h1>Some sample header text.</h1>
<h3>Some text.</h3>
<h6>Some header text.</h6>
<h4></h4>
<h5>Some text.</h5>
<h2>Some fairly meaningless text.</h2>
There is an example of each of the six headers
Trang 6Try It Out Matching HTML Headers
1. Open PowerGrep, and enter the regular expression pattern <h[1-6]> in the Search: text box.
2. Enter C:\BRegExp\Ch05 in the Folder text box, assuming that you have saved the Chapter 5files from the download in that directory
3. Enter HTMLHeaders.txt in the File Mask text box.
4. Click the Search button, and inspect the results, as shown in Figure 5-17.
indi-If the ^metacharacter occurs in any position inside square brackets other than the character that diately follows the left square bracket, the ^metacharacter has its literal meaning — that is, it matchesthe ^character
Trang 7imme-A test file, Carets.txt, is shown here:
14^2 expresses the idea of 14 to the power 2
The ^ character is called a caret
The _ character is called an underscore or underline character
3^2 = 9
Eating ^s helps you see in the dark At least that’s what I think he said
The problem definition can be expressed as follows:
Match any occurrence of the following characters: the underscore, the caret, or the numeric digit 3
The character class to satisfy that problem definition is as follows:
[_^3]
Try It Out Using the ^ Inside a Character Class
This example matches the three characters mentioned in the preceding problem definition:
1. Open OpenOffice.org Writer, and open the test file Carets.txt
2. Use the Ctrl+F keyboard shortcut to open the Find & Replace dialog box
3. Check the Regular Expressions and Match Case check boxes, and enter the pattern [_^3] in the
Search For text box
4. Click the Find All button, and inspect the results, as shown in Figure 5-18
5. Modify the regular expression pattern so that it reads [^_3]
6. Click the Find All button, and compare the results shown in Figure 5-19 with the previousresults
How It Works
When the pattern is [_^3], the meaning is simply a character class that matches three characters: theunderscore, the caret, and the numeric digit 3
When the ^immediately follows the left square bracket, [, that creates a negated character class, which
in this case has the meaning “Match any character except an underscore or the numeric digit 3.”
Trang 8Figure 5-18
How to Use the - Metacharacter
You have already seen how the hyphen can be used to indicate a range inside a character class Thequestion therefore arises as to how you can specify a literal hyphen inside a character class
The safest way is to use the hyphen as the first character after the left square bracket In some tools, such
as the Komodo Regular Expressions Toolkit, you can also use the hyphen as the character immediatelybefore the right square bracket to match a hyphen In OpenOffice.org Writer, for example, that doesn’twork
Trang 9Figure 5-19
Negated Character Classes
Negated character classes always attempt to match a character So the following negated character classmeans “Match a character that is not in the range uppercase Athrough F.”
Trang 10Combining Positive and Negative Character Classes
Some languages, such as Java, allow you to combine positive and negative character classes
The following example shows how combined character classes can be used The problem definition is asfollows:
Match characters A and D through Z
An alternative way to express that notion is as follows:
Match characters A through Z but not B through D
You can express that in Java by combining character classes, as follows:
public class CombinedClass2{
public static void main(String args[])throws Exception{
String TestString = args[0];
String regex = “[A-Z&&[^B-D]]”;
Trang 11Try It Out Combined Character Classes
These instructions assume that you have Java 1.4 correctly installed and configured This exampledemonstrates how to use combined character classes in Java:
1. Open a command prompt window, and at the command –line, type javac CombinedClass2.java
to compile the source code
2. Type java CombinedClass2.java “A C E G”to run the program and supply a test string
A regular expression is assigned to the variable regex:
String regex = “[A-Z&&[^B-D]]”;
The regular expression is the combined character class described earlier
The compile()method of the Patternobject is executed with the regexvariable as its argument:Pattern p = Pattern.compile(regex);
Next, the matcher()method of the Patternobject, p, is executed with the TestStringvariable as itsargument:
Matcher m = p.matcher(TestString);
A new variable, match, is assigned the value null:
String match = null;
Trang 12The simple output shows the test string that was supplied on the command line; the regular expressionpattern that was used; and, if there are one or more matches, a list of each match or, if there was nomatch, a message indicating that no matches were found:
System.out.println(“INPUT: “ + TestString);
System.out.println(“REGEX: “ + regex);
while (m.find()){
POSIX Character Classes
Some regular expression implementations support a very different character class notation: the POSIXcharacter class notation The POSIX approach uses a naming convention for a number of potentially use-ful character classes instead of specifying character classes in the way you saw earlier in this chapter Forexample, instead of the character class [A-Za-z0-9], where the characters are listed, the POSIX charac-ter class uses [:alnum:], where alnumis an abbreviation for alphanumeric Personally, I prefer the syn-tax used earlier in this chapter However, because you may see code that uses POSIX character classes,this section gives brief information about them
As an example, the [:alnum:]character class is shown
The POSIX syntax is dependent on locale The syntax described in this section relates to language locales.
English-The [:alnum:] Character Class
The [:alnum:]character class varies in how it is implemented in various tools Broadly speaking, the[:alnum:]class is equivalent to the following character class:
[A-Za-z0-9]
However, there are different interpretations of [:alnum:]
Trang 13Try It Out The [:alnum:] Class in OpenOffice.org Writer
In OpenOffice.org Writer it is necessary to add a ?quantifier (or other quantifier) to successfully use the[:alnum:]character class:
1. Open OpenOffice.org Writer, and open the sample file AlnumTest.txt
2. Use the Ctrl+F keyboard shortcut to open the Find & Replace dialog box
3. Check the Regular Expressions and Match Case check boxes, and enter the pattern [:alnum:]? inthe Search For text box
4. Click the Find All button, and inspect the highlighted text, as shown in Figure 5-21, to identifymatches for the pattern [:alnum:]?
Notice that the underscore character, which occurs twice in the final line of text in the sample file, is notmatched by the [:alnum:]?pattern
Figure 5-21
If Step 4 is replaced by clicking the Find button, assuming that the cursor is at the beginning of the testfile, the initial uppercase Awill be matched, because that is the first matching character
Trang 14How It Works
If the regular expression engine starts at the position immediately before the Aof the first line of the testfile, the Ais tested against the pattern [:alnum:]? There is a match because uppercase Ais an alpha-betic character The matched text is highlighted in reverse video
When the Find All button is used, after that first successful match the regular expression engine moves
to the position between Aand Band attempts to match against the following character, B That matches,and so it, too, is highlighted in reverse video The regular expression engine moves to the next positionand then matches the C, and so on When the newline character is reached, there is no match against thepattern [:alnum:]?, and the regular expression engine moves on to the position after the newline char-acter and attempts to match the next character
When the regular expression engine reaches the position before the underscore character and attempts tomatch that character, there is no match, because the underscore character is neither an alphabetic charac-ter nor a numeric digit
Exercises
1. You have a document that contains American English and British English State a problem nition to locate occurrences of license(U.S English) and licence(British English) Specify aregular expression pattern using a character class to find both sequences of characters
defi-2. The pattern (20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]was used earlier in thischapter to match dates As written, this pattern would allow months such as 00, 13, or 19andallow days such as 00, 32, and 39 Modify the relevant components of the pattern so that onlymonths 01through 12and days 01through 31are allowed
Trang 16String , Line, and Word
Boundaries
This chapter looks at metacharacters that match positions before, between, or after characters
rather than selecting matching characters These positional metacharacters complement the
meta-characters that were described in Chapter 4, each of which signified meta-characters to be matched.For example, you will see how to match characters, or sequences of characters, that immediatelyfollow the position at the beginning of a line In normal English you might, for example, say thatyou want to match a specified sequence of characters only when they immediately follow thebeginning of a line or the beginning of the whole test text The implication is that you don’t want
to match the specified sequence of characters if they occur anywhere else in the text So using apositional character in this way can significantly change the sequences of characters that match orfail to match
Equally, you might want to look for whole words rather than sequences of characters or sequences
of characters when they occur in relation to the beginning or end of a word Many regular sion implementations have positional metacharacters that allow you to do that
expres-This chapter provides you with the information needed to make matches based on the position of
a sequence of characters
The term anchor is sometimes used to refer to the metacharacters that match a
posi-tion rather than a character.
In some documentation (for example, the documentation for NET regular sion functionality), these same positional metacharacters are termed atomic zero-width
expres-assertions.
Trang 17This chapter looks at how to do the following:
❑ Use the ^metacharacter, which matches the position at the beginning of a string or a line
❑ Use the $metacharacter, which matches the position at the end of a string or a line
❑ Use the \<and \>metacharacters to match the beginning and end of a word, respectively
❑ Use the \bmetacharacter, which matches a word boundary (which can occur at the beginning
of a word or at the end of a word)
String , Line, and Word Boundaries
Metacharacters that allow you to create patterns that match sequences of characters that occur at specificpositions can be very useful
For example, suppose that you wanted to find all lines that begin with the word The With the niques you have seen and used in earlier chapters, you can readily create a literal pattern to match the sequence of characters The, but with those techniques you haven’t been able to specify where thesequence of characters occurs in the text, nor whether it is a whole word or forms part of a longer word.The relevant pattern, written as The, would match sequences of characters such as There, Then, and so
tech-on at the beginning of a sentence in addititech-on to the word Theand would also match parts of personal orbusiness names such as Theodoreor Theatre
Similarly, assuming that you used the pattern Thein a case-insensitive mode, you would also (possibly
as an undesired side effect) match sequences of characters such as thein the word lathe At othertimes, you might want to find a sequence of characters only when they occur at the end of a word (againfor example, the thein lathe)
The ^and $metacharacters, which are used to specify a position in relation to the beginning and end of
a line or string, are discussed and demonstrated first
when applied to the test text
The Thespian Theatre opens at 19:00
would match the sequence of characters Thein the words The, Thespian, and Theatre
Trang 18However, the same pattern preceded by the ^metacharacter
^Thewhen applied to the same test text would match only the sequence of characters Thein the word Thebecause that sequence of characters occurs immediately after the start of the string
Try It Out Theatre ExampleUse the very simple test text in the file Theatre.txt:The Thespian Theatre opens at 19:00
1. Open PowerGrep, and check the Regular Expression check box.
2. Enter the pattern The in the Search text box.
3. Enter C:\BRegExp\Ch06 in the Folder text box.
4. Enter Theatre.txt in the File Mask text box.
5. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-1.Notice that the information in the Results area indicates three matches for the pattern The
Figure 6-1
6. Edit the regular expression pattern so that it reads ^The
7. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-2.Notice that there is now only one match, in contrast to the three matches before you edited theregular expression pattern
The ^metacharacter, when used outside a character class, does not have the negation meaning that it has when used as the first character inside a character class.
Trang 19Figure 6-2
How It Works
The regular expression engine starts at the position before the first character in the test file The firstmetacharacter in the pattern, the ^metacharacter, is matched against the regular expression engine’scurrent position Because the regular expression engine is at the beginning of the file, the condition spec-ified by the ^metacharacter is satisfied, so the regular expression engine can proceed to attempt tomatch the other characters in the regular expression pattern The next character in the pattern, the literaluppercase T, is matched against the first character in the test file, which is uppercase T There is a match,
so the regular expression engine attempts to match the next character in the pattern, lowercase h, againstthe second character in the test text, which is also lowercase h The literal hin the pattern matches the lit-eral hin the test text Then the regular expression engine attempts to match the literal ein the patternagainst the third character in the test text, lowercase e There is a match Because all components of theregular expression match, the entire regular expression matches
If the regular expression attempts a match when the current position is anything other than the positionbefore the first character of the test text, matching fails on that first metacharacter, ^ Therefore, the pat-tern as a whole cannot match Matching fails except at the beginning of the test text
The ^ Metacharacter and Multiline Mode
In the preceding example, the test text is a single line, so you were able to examine the use of the ^metacharacter without bothering about whether the ^metacharacter would match the beginning of thetest text or the beginning of each line, because the two concepts were the same However, in several toolsand languages, it is possible to modify the behavior of the ^metacharacter so that it matches the positionbefore the first character of each line or only at the beginning of the first line of the test file
When using the Komodo Regular Expression Toolkit, for example, the following test text
This
Then
will fail to find a match when the pattern is as follows:
^The
Trang 20Figure 6-3 shows the failure to match.
Figure 6-3
However, if you check the Multi-Line Mode check box, the sequence of characters Theon the second line
is highlighted and in the gray area below the message Match succeeded: 0 groupsis displayed, asyou can see in Figure 6-4
Figure 6-4
Trang 21When multiline mode is used, the position after a Unicode newline character is treated in the same way
as the position that comes at the beginning of the test file A Unicode newline character matches any ofthe characters or character combinations that can be used to express the notion of a newline
Not all programming languages support multiline mode How individual programming languages treatthis issue is discussed and, where appropriate, demonstrated in later chapters that deal with individualprogramming languages
Try It Out The ^ Metacharacter and Multiline Mode
This exercise uses the test file TheatreMultiline.txt:
The Thespian Theatre opens at 19:00
Then theatrical people enter the building
They greatly enjoy the performance
The interval is the time for liquid refreshment
Notice that each line begins with the sequence of characters The
Some tools, such as PowerGrep, are in multiline mode by default, as shown here
1. Open PowerGrep, and check the Regular Expressions check box
2. Enter the regular expression pattern ^The in the Search text box.
3. Enter C:\BRegExp\Ch06 in the Folder text box Adjust this if you chose to put the downloadfiles in a different folder
4. Enter TheatreMultiline.txt in the File Mask text box.
5. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-5.Notice the character sequence Theat the beginning of each line is highlighted as a match, indi-cating the default behavior of multiline mode
Figure 6-5
Trang 22The $ Metacharacter
The ^metacharacter allows you to be specific about where a matching sequence of characters occurs atthe beginning of a file or the beginning of a line The $metacharacter provides complementary function-ality in that it specifies matches in a sequence of characters that immediately precede the end of a line or
the$
Try It Out The $ MetacharacterThis example demonstrates the use of the pattern the$:
1. Open PowerGrep, and check the Regular Expressions check box
2. Enter the pattern the$ in the Search text box.
3. Enter C:\BRegExp\Ch06 in the Folder text box.
4. Enter Lathe.txt in the File Mask text box.
5. Click the Search button, and inspect the results displayed in the Results area, as shown in Figure 6-6
Figure 6-6
Notice that there is only one match and that the sequence of characters Theat the beginning ofthe line does not match nor does the word the, which precedes the word lathe
6. Delete the $metacharacter in the Search text box
7. Click the Search button, and inspect the revised results in the Results area.
Trang 23Notice that with the $metacharacter deleted the pattern now has three matches (not illustrated) Thefirst is the Theat the beginning of the test text That matches because the default behavior in PowerGrep
is a case-insensitive match The second is the word thebefore the word lathe The third is the charactersequence the, which is contained in the word lathe
How It Works
The default behavior of PowerGrep is case-insensitive matching When the regular expression enginestarts to match after Step 6, it starts at the position before the initial The The regular expression engineattempts to match Theand succeeds Finally, the regular expression engine attempts to match the $metacharacter against the position that follows the lowercase ein the test text That position is not theend of the test string; therefore, the match fails Because one component of the pattern fails to match, thewhole pattern fails to match
Attempted matching progresses through the test text The first three characters of the pattern matchwhen the regular expression engine is at the position immediately before the word the However, asdescribed earlier, the $ metacharacter fails to match; therefore, there is no match for the whole pattern.However, when the regular expression engine reaches the position after the aof latheand attempts tomatch, there is a match The first character of the pattern, lowercase t, matches the next character, thelowercase tof lathe The second character of the pattern, lowercase h, matches the hof lathe Thethird character of the pattern, lowercase e, matches the lowercase eof lathe The $metacharacter of the pattern does match, because the eof latheis the final character of the test string Because all com-ponents of the pattern match, the whole pattern matches, and the character sequence theof lathe ishighlighted as a match in Figure 6-6
The $ Metacharacter in Multiline Mode
Like the ^metacharacter, the $metacharacter can have its behavior modified when it used in multilinemode However, not all tools or languages support multiline mode for the $metacharacter
Tools or languages that support the $metacharacter in multiline mode use the $metacharacter to matchthe position immediately before a Unicode newline character Some also match the position immediatelybefore the end of the test string, but not all do, as you will see later
The sample file, ArtMultiple.txt, is shown here:
A part for his car
Wisdom which he wants to impart
Leonardo da Vinci was a star of medieval art
At the start of the race there was a false start
Notice that to make the example a test of the $metacharacter, the period that might be expected at theend of each sentence has been omitted
Trang 24Try It Out The $ Metacharacter in Multiline ModeThis example demonstrates the use of the $metacharacter with multiline mode:
1. Open PowerGrep, and check the Regular Expressions check box
2. Enter the pattern art in the Search text box.
3. Enter the text C:\BRegExp\Ch06 in the Folder text box.
4. Enter the text ArtMultiple.txt in the File Mask text box.
5. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-7.Notice that occurrences of the sequence of characters artare matched when they occur at theend of a line and at other positions — in this example, partin Line 1 and the first occurrence ofstartin Line 7
Figure 6-7
6. Edit the regular expression pattern to add the $metacharacter at the end, giving art$
7. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-8.Notice that the matches for the pattern artthat were previously present in the words partinLine 1 and the first occurrence of startin Line 7 are no longer present, because they do notoccur at the end of a line The $metacharacter means that matches must occur at the end of
a line
Trang 25When an attempt is made to match artin part in the first line, the first three characters of the regularexpression pattern match; however, the final $metacharacter of the pattern art$fails to match Because
a component of the pattern has failed to match, the entire pattern fails to match
When the regular expression engine has reached a position immediately before the aof impart, it canmatch the first three characters of the pattern art$successfully against, respectively, the a, r, and tofimpart Finally, an attempt is made to match the $metacharacter against the position immediately fol-lowing the tof impart Because that position immediately precedes a Unicode newline character (that
is it is the final position on that line), there is a match Because all the components of the pattern match,the entire pattern matches
When the regular expression engine has reached a position immediately before the aof the secondstarton the final line, it can match the first three characters of the pattern art$successfully against,respectively, the a, r, and tof start Finally, an attempt is made to match the $metacharacter againstthe position immediately following the tof start Because that position immediately precedes the end
of the test string (that is, it is the final position of the test file), there is a match Because all the nents of the pattern match, the entire pattern matches
Trang 26compo-Using the ^ and $ Metacharacters Together
Using the ^and $metacharacters together can be useful to identify lines that consist entirely of desiredcharacters This can be very useful when validating user input, for example
The sample text, ABCPartNumbers.txt, is shown here:
ABC123There is a part number ABC123
ABC234
A purchase order for 400 of ABC345 was received yesterday
ABC789Notice that some lines consist only of a part number, whereas other lines include the part number as part
of some surrounding text
The intention is to match lines that consist only of a part number The problem definition is as follows:
Match a beginning of line position, followed by the literal sequence of characters A , B , and C , lowed by three numeric digits, followed by a position that is either the end-of-line position or an end-of-string position.
fol-Try It Out Matching Part NumbersThis example demonstrates using the ^and $metacharacters in the same pattern:
1. Open OpenOffice.org Writer, and open the test file ABCPartNumbers.txt
2. Open the Find & Replace dialog box, using the Ctrl+F keyboard shortcut, and check the RegularExpressions and Match Case check boxes
3. Enter the pattern ^ABC[0-9]{3}$ in the Search For text box.
4. Click the Find All button, and inspect the highlighted text, as shown in Figure 6-9 Notice howthree occurrences of a sequence of characters representing a part number are highlighted asmatches, while two occurrences of a part number are not highlighted because they are notmatches
Trang 27Figure 6-9
How It Works
The regular expression engine begins the matching process at the start of the test file It attempts tomatch the ^metacharacter against the current position There is a match It next attempts to match theliteral character Ain the pattern against the first character in the line, which is uppercase A There is amatch The matching process is repeated successfully for the literal characters Band C Then the regularexpression engine attempts to match the pattern [0-9]{3} It attempts to match the character class[0-9]against the character 1in the test text That matches It then proceeds to match the character class[0-9]a second time, this time against the character 2 That also matches It next proceeds to match thecharacter class [0-9]for a third time, as indicated by the {3}quantifier, against the character 3 That too matches Finally, it attempts to match the $metacharacter against the position following the
Trang 28character 3 That matches because it immediately precedes a Unicode newline character Each nent of the pattern matches; therefore, the entire pattern matches.
compo-At the beginning of the second line, the regular expression successfully matches the ^metacharacter Itnext attempts to match the literal character Ain the pattern against the first character on the line, anuppercase T The attempt at matching fails Any subsequent attempt to match on that line fails when theattempt is made to match the ^metacharacter because the position is not at the beginning of the line
Matching Blank Lines
One of the potential uses of the ^and $metacharacters together is to match blank lines The followingpattern should match a blank line, because the ^metacharacter signifies the beginning of the line and the
$metacharacter signifies the position immediately either before a Unicode newline character or the end
of the test string
^$
However, not all tools support this pattern
The test file, WithBlankLines.txt, is shown here:
Line 1Line 3 which follows a blank lineLine 5 which follows a second blank lineLine 7 which follows a third blank line
After Line 7, there are two further blank lines to end the test file.
Try It Out Replacing Blank Lines
1. Open OpenOffice.org Writer, and open test file WithBlankLines.txt
2. Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the RegularExpressions and Match Case check boxes
3. Enter the pattern ^$ in the Search For text box.
4. Click the Find All button, and inspect the results, as shown in Figure 6-10
Trang 29Figure 6-10
Each blank line, except the last two, is highlighted as a match If you try to scroll down, you willfind that OpenOffice.org Writer has lost one of the blank lines that is present if you open theWithBlankLines.txtfile in Notepad If you manually reenter one of the blank lines thatOpenOffice.org Writer strips out, an additional blank line will match A blank line at the end of
a file seems not to match in OpenOffice.org Writer
5. Click the Replace All button, and inspect the results, as shown in Figure 6-11 Notice that thethree previously highlighted blank lines have been deleted
Trang 30Figure 6-11
How It Works
The second line of the original test file is a blank line When the regular expression engine is at the tion at the beginning of that blank line, matching is attempted against the ^metacharacter There is amatch Without moving its position, the regular expression engine then attempts to match the $meta-character against the same position Because that position immediately precedes a Unicode newlinecharacter, there is a match for the $metacharacter, too Therefore, the entire pattern matches InOpenOffice.org Writer, the matching of the blank line leads to the entire width of the text area on thatline being highlighted
posi-When the regular expression engine is at the beginning of the third line of the original file, it firstattempts to match the ^metacharacter That matches It next attempts to match the $metacharacteragainst the same position Because the position is followed by the character uppercase L, it is not theposition that precedes a Unicode newline character Therefore, the attempt at matching fails
Trang 31Working with Dollar Amounts
Because the $metacharacter in a regular expression pattern indicates the end-of-line (or end-of-string)position, you cannot use that metacharacter to match the dollar currency symbol in a document Tomatch the dollar sign in a string, you must use the \$escape sequence
The sample file, DollarUsage.txt, is used to explore how to use the \$escape sequence:
The pound, £, and US dollar, $, are major global currencies
As you can see, the $sign may occur in situations other than simply being at the beginning of a sequence
of numeric digits For example, the first line indicates how the dollar sign might appear in a piece of rative text The third line, 99,00$, indicates how a dollar amount might be written in a non-Englishlocale or, perhaps, how it might be written by someone who is not a native speaker of English
nar-Matching a literal $sign is straightforward; you can simply use the following regular expression pattern,which will match all occurrences of the dollar sign in text:
\$
Figure 6-12 shows the application of that simple pattern in PowerGrep
Suppose that you want to detect a dollar sign only when the dollar sign is followed by numeric digits.Even something seemingly this simple may not be entirely straightforward For example, the third-to-last line has a space character following the $sign, which you need to take into account if you want tomatch all relevant occurrences of the $sign:
$ 0.99
Trang 32Depending on the regular expression implementation, you can express a pattern for numeric digits inseveral ways: \d, [0-9], and [:digit:].
First, try to match situations where a dollar sign is followed by one or more numeric digits, followed by
a period, followed by zero or more numeric digits The following pattern expresses that:
\$[0-9]+\.[0-9]*
The \$matches a literal dollar sign The character class [0-9]matches a numeric digit, and the +tifier indicates that there is at least one numeric digit Following that is a literal period character indi-cated by the escape sequence \ Finally, the pattern [0-9]*indicates that zero or more numeric digitscan occur after the period
quan-Figure 6-13 shows this pattern applied against DollarUsage.txtwhen using OpenOffice.org Writer.Notice that only three of the examples in DollarUsage.txtare matched Can you see, for example,why the examples $1,000,000and $1000have not been matched?
Trang 33\$ *[0-9]+\.?[0-9]*
By using the *quantifier after the space character in the preceding pattern, you can allow for situationswhere there is more than a single space character after the dollar sign
Trang 34Figure 6-14
Revisiting the IP Address Example
In Chapter 5, we spent some time looking at how you could use character classes to match IP addresses,using the following sample file, IPLike.txt:
12.12.12.12255.255.256.25512.255.12.255256.123.256.1238.234.88.55196.83.83.1918.234.88,5588.173.71.66241.92.88.103
Trang 35Now that you have looked at the meaning and use of the ^and $metacharacters, you are in a position totake that example to a successful conclusion.
Try It Out Matching IP Addresses
These instructions assume that you have closed OpenOffice.org Writer
1. Open OpenOffice.org Writer, and open the test file IPLike.txt
2. Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the RegularExpressions and Match Case check boxes
3. Enter the regular expression pattern ^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.)
{3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$in the Search For text box
4. Click the Find All button, and inspect the results, as shown in Figure 6-15 Notice that the linescontaining a value of 256are not matched, which is what you wanted
The regular expression pattern that works is shown here:
9][0-9]|[1-9][0-9]|[0-9])$
^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-Figure 6-15
Trang 36How It Works
First, let’s break the regular expression down into its component parts
The initial ^metacharacter indicates that there is a match only when matching is being attempted from aposition at the beginning of a line
The following component indicates several options for numeric values, each of which is followed by aliteral period in the test text:
((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
Remember that the escape sequence that matches a literal period is \ If you had used a period in thepattern, the test text would have matched, but so would any alphanumeric character This would haveled to undesired matches such as the following, which is clearly not an IP address:
12G255F12H255Using the metacharacter would have lost much of the specificity that you obtain by using the \.metacharacter
The first time the following pattern is processed, it immediately follows the position that indicates thestart of a line:
((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
That means a match succeeds if any of the options inside the nested parentheses are found between thestart-of-line position and a literal period character Given the way the options are constructed, onlynumeric values from 0to 255are matched
The second time the following pattern is matched, you know that it is preceded by a literal period acter (because the second attempt at matching follows the first attempt, which you know ends with a lit-eral period):
Because the pattern ends in a $metacharacter, you know that it matches only a numeric value from 0to
255only when it follows a literal period character (which was the final character to match the third attempt
to use the earlier component of the pattern) and when it is followed by a Unicode newline character
Trang 37What Is a Word?
The notion of what constitutes a word might seem, at first sight, to be obvious But if you were asked tosay which of the sequences of characters on the following lines were words, what would your answerbe? And what criteria would you use to arrive at your opinion?
Clearly, it isn’t realistic to expect a text processor to have knowledge about what is or isn’t a word in English,French, German, or any of a host of other languages Similarly, you can’t expect a text processor to haveknowledge in all technical areas So you need another technique — a more mechanistic technique — to allowidentification of word boundaries
Identifying Word Boundaries
A word boundary can be viewed as two positions: one at the beginning of a sequence of characters thatform a word and one at the end of a sequence of characters that form a word
Depending on which tools or languages you use, there are metacharacters that match a word-boundaryposition occurring at the beginning of a word, a word-boundary position occurring at the end of a word,
or both
The \< Syntax
The \<metacharacter identifies a word-boundary position occurring at the beginning of a word It ispreceded by a character that is not an alphabetic character (for example, a space character) or is a beginning-of-line position
A simple sample file, BoundaryTest.txt, is shown here:
ABC DEF GHI
GHI ABC DEF
ABC DEF GHI
CAB CBA AAA
Trang 38The problem definition is as follows:
Match an uppercase A when it occurs immediately following a word boundary.
In other words, match an uppercase Awhen it is preceded by a nonword character or by a start-of-string
or start-of-line position
Try It Out Matching a Beginning-of-Word Word Boundary
1. Open OpenOffice.org Writer, and open the file BoundaryTest.txt
2. Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the RegularExpressions and Match Case check boxes
3. Enter the pattern \<A in the Search For text box.
4. Click the Find All button, and inspect the results, as shown in Figure 6-16.
Figure 6-16
Trang 39How It Works
On the first line, the Aof ABCfollows the start-of-text position, so there is a match
On the second line, the Aof ABCfollows a space character (which is a nonword character), so there is amatch
On the third line, the Aof ABCfollows a start-of-line position, so there is a match for the pattern \<A
On the final line, the Aof CABhas an alphabetic character before it, so the pattern \<Adoes not match.The Aof CBAis followed by a nonword character but is preceded by an alphabetic character, so the pat-tern \<Adoes not match
The first Aof AAAis preceded by a nonword character, so the pattern \<Amatches However, the secondand third Aof AAAis preceded by an alphabetic character and does not match
The \>Syntax
The \>metacharacter signifies a word boundary that occurs at the end of a sequence of word characters
In other words, it matches a word boundary that occurs at the end of a word
The test file, EndBoundary.txt, is shown here:
Theodore said “This is a lathe
I shaved today and my new shaving cream made a good lather
A lathe is a tool for turning wood or metal
The Thespian Theatre is something I am loathe to attend
The quick brown fox jumped over the lazy dog
The task is to match the sequence of characters thewhen they occur before a word boundary at the end
of a word
Try It Out The \> Metacharacter
1. Open OpenOffice.org Writer, and open the file EndBoundary.txt
2. Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the RegularExpressions check box, but do not check the Match Case check box, because you want a case-insensitive search on this occasion
3. Enter the pattern the\> in the Search For text box.
4. Click the Find All button, and inspect the results, as shown in Figure 6-17.