Beginning Regular Expressions 2005 phần 3 doc

Your task is to select alloccurrences of sequences of characters that represent dates assume for this example that dates areexpressed only using digits and separators and are not express

Trang 2

However, in PowerGrep, the regular expression pattern [t-r]ightwon’t compile and produces theerror shown in Figure 5-14.

Figure 5-14

There is, typically, no advantage in attempting to use reverse ranges in character classes, and I suggestthat you avoid using these

A Potential Range Trap

Suppose that you want to allow for different separators in dates occurring in a document or set of ments Among the issues this problem throws up is a possible trap in expressing character ranges

docu-As a first test document, we will use Dates.txt, shown here:

2004-12-312001/09/112003.11.192002/04/292000/10/192005/08/282006/09/18

Trang 3

As you can see, in this file the dates are in YYYY/MM/DD format, but sometimes the dates use thehyphen as a separator, sometimes the forward slash, and sometimes the period Your task is to select alloccurrences of sequences of characters that represent dates (assume for this example that dates areexpressed only using digits and separators and are not expressed using names of months, for example).

So if you wanted to select all dates, whether they use hyphens, forward slashes, or periods as separators,you might try a regular expression pattern like this:

(20|19)[0-9]{2}[.-/][01][0-9][.-/][0123][0-9]

In the character class [.-/], which you attempt to use to match the separator, the sequence of characters(period followed by hyphen followed by forward slash) is interpreted as the range from the period to theforward slash However, as you can see in the top row of Figure 5-15, the hyphen is U+002D, and theperiod (U+002E) is the character immediately before the forward slash (U+002F) So, undesirably, thepattern -/specifies a range that contains only the period and forward-slash characters

Figure 5-15

Characters can be expressed using Unicode numeric references The period is

U+002E; uppercase Ais U+0041 The Windows Character Map shows this syntax for

characters if you hover the mouse over characters of interest.

Trang 4

To use the hyphen without creating a range, the hyphen should be the first character in the characterclass:

[-./]

This gives a pattern that will match each of the sample dates in the file Dates.txt:(20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]

Try It Out Matching Dates

1. Open PowerGrep, and enter the regular expression pattern (20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]

in the Searc text box

2. Enter C:\BRegExp\Ch05 in the Folder: text box, assuming that you have saved the Chapter 5

files from the download in that directory

3. Enter Dates.txt in the File Mask text box.

4. Click the Search button, and inspect the results shown in Figure 5-16 Notice particularly thatthe first match, 2004-12-31, includes a hyphen confirming that the regular expression patternworks as desired

Trang 5

The next component of the pattern, [01], matches the numeric digits 0or 1, because months alwayshave 0or 1as the first digit in this date format Similarly, the next component, the character class [0-9],matches any number from 0through 9 This would allow numbers for the month such as 14or 18,which are obviously undesirable One of the exercises at the end of this chapter will ask you to provide

a more specific pattern that would allow only values from 01to 12inclusive

Next, the character class pattern [-./] matches a single character that is a hyphen, a period, or a forward slash.Finally, the pattern [0123][0-9]matches days of the month beginning with 0, 1, 2, or 3 As written, thepattern would allow values for the day of the month such as 00, 34or 38 A later exercise will ask you tocreate a more specific pattern to constrain values to 01through 31

Finding HTML Heading Elements

One potential use for characters classes is in finding HTML/XHTML heading elements As you probablyknow, HTML and XHTML 1.0 have six heading elements: h1, h2, h3, h4, h5, and h6 In XHTML the hmust be lowercase In HTML it is permitted to be hor H

First, assume that all the elements are written using a lowercase h So it would be possible to match thestart tag of all six elements, assuming that there are no attributes, using a fairly cumbersome regularexpression with parentheses:

<(h1|h2|h3|h4|h5|h6)>

In this case the <character is the literal left angled bracket, which is the first character in the start tag.Then there is a choice of six two-character sequences representing the element type of each HTML/XHTML heading element Finally, a >is the final literal character of the start tag

However, because there is a sequence of numbers from 1to 6, you can use a character class to match thesame start tags, either by listing each number literally:

<h[123456]>

or by using a range in the character class:

<h[1-6]>

The sample file, HTMLHeaders.txt, is shown here:

<h1>Some sample header text.</h1>

<h6>Some header text.</h6>

<h2>Some fairly meaningless text.</h2>

There is an example of each of the six headers

Trang 6

Try It Out Matching HTML Headers

1. Open PowerGrep, and enter the regular expression pattern <h[1-6]> in the Search: text box.

2. Enter C:\BRegExp\Ch05 in the Folder text box, assuming that you have saved the Chapter 5files from the download in that directory

3. Enter HTMLHeaders.txt in the File Mask text box.

4. Click the Search button, and inspect the results, as shown in Figure 5-17.

indi-If the ^metacharacter occurs in any position inside square brackets other than the character that diately follows the left square bracket, the ^metacharacter has its literal meaning — that is, it matchesthe ^character

Trang 7

imme-A test file, Carets.txt, is shown here:

14^2 expresses the idea of 14 to the power 2

The ^ character is called a caret

The _ character is called an underscore or underline character

3^2 = 9

Eating ^s helps you see in the dark At least that’s what I think he said

The problem definition can be expressed as follows:

Match any occurrence of the following characters: the underscore, the caret, or the numeric digit 3

The character class to satisfy that problem definition is as follows:

[_^3]

Try It Out Using the ^ Inside a Character Class

This example matches the three characters mentioned in the preceding problem definition:

1. Open OpenOffice.org Writer, and open the test file Carets.txt

2. Use the Ctrl+F keyboard shortcut to open the Find & Replace dialog box

3. Check the Regular Expressions and Match Case check boxes, and enter the pattern [_^3] in the

Search For text box

4. Click the Find All button, and inspect the results, as shown in Figure 5-18

5. Modify the regular expression pattern so that it reads [^_3]

6. Click the Find All button, and compare the results shown in Figure 5-19 with the previousresults

How It Works

When the pattern is [_^3], the meaning is simply a character class that matches three characters: theunderscore, the caret, and the numeric digit 3

When the ^immediately follows the left square bracket, [, that creates a negated character class, which

in this case has the meaning “Match any character except an underscore or the numeric digit 3.”

Trang 8

Figure 5-18

How to Use the - Metacharacter

You have already seen how the hyphen can be used to indicate a range inside a character class Thequestion therefore arises as to how you can specify a literal hyphen inside a character class

The safest way is to use the hyphen as the first character after the left square bracket In some tools, such

as the Komodo Regular Expressions Toolkit, you can also use the hyphen as the character immediatelybefore the right square bracket to match a hyphen In OpenOffice.org Writer, for example, that doesn’twork

Trang 9

Figure 5-19

Negated Character Classes

Negated character classes always attempt to match a character So the following negated character classmeans “Match a character that is not in the range uppercase Athrough F.”

Trang 10

Combining Positive and Negative Character Classes

Some languages, such as Java, allow you to combine positive and negative character classes

The following example shows how combined character classes can be used The problem definition is asfollows:

Match characters A and D through Z

An alternative way to express that notion is as follows:

Match characters A through Z but not B through D

You can express that in Java by combining character classes, as follows:

public class CombinedClass2{

public static void main(String args[])throws Exception{

String TestString = args[0];

String regex = “[A-Z&&[^B-D]]”;

Trang 11

Try It Out Combined Character Classes

These instructions assume that you have Java 1.4 correctly installed and configured This exampledemonstrates how to use combined character classes in Java:

1. Open a command prompt window, and at the command –line, type javac CombinedClass2.java

to compile the source code

2. Type java CombinedClass2.java “A C E G”to run the program and supply a test string

A regular expression is assigned to the variable regex:

String regex = “[A-Z&&[^B-D]]”;

The regular expression is the combined character class described earlier

The compile()method of the Patternobject is executed with the regexvariable as its argument:Pattern p = Pattern.compile(regex);

Next, the matcher()method of the Patternobject, p, is executed with the TestStringvariable as itsargument:

Matcher m = p.matcher(TestString);

A new variable, match, is assigned the value null:

String match = null;

Trang 12

The simple output shows the test string that was supplied on the command line; the regular expressionpattern that was used; and, if there are one or more matches, a list of each match or, if there was nomatch, a message indicating that no matches were found:

System.out.println(“INPUT: “ + TestString);

System.out.println(“REGEX: “ + regex);

while (m.find()){

POSIX Character Classes

Some regular expression implementations support a very different character class notation: the POSIXcharacter class notation The POSIX approach uses a naming convention for a number of potentially use-ful character classes instead of specifying character classes in the way you saw earlier in this chapter Forexample, instead of the character class [A-Za-z0-9], where the characters are listed, the POSIX charac-ter class uses [:alnum:], where alnumis an abbreviation for alphanumeric Personally, I prefer the syn-tax used earlier in this chapter However, because you may see code that uses POSIX character classes,this section gives brief information about them

As an example, the [:alnum:]character class is shown

The POSIX syntax is dependent on locale The syntax described in this section relates to language locales.

English-The [:alnum:] Character Class

The [:alnum:]character class varies in how it is implemented in various tools Broadly speaking, the[:alnum:]class is equivalent to the following character class:

[A-Za-z0-9]

However, there are different interpretations of [:alnum:]

Trang 13

Try It Out The [:alnum:] Class in OpenOffice.org Writer

In OpenOffice.org Writer it is necessary to add a ?quantifier (or other quantifier) to successfully use the[:alnum:]character class:

1. Open OpenOffice.org Writer, and open the sample file AlnumTest.txt

2. Use the Ctrl+F keyboard shortcut to open the Find & Replace dialog box

3. Check the Regular Expressions and Match Case check boxes, and enter the pattern [:alnum:]? inthe Search For text box

4. Click the Find All button, and inspect the highlighted text, as shown in Figure 5-21, to identifymatches for the pattern [:alnum:]?

Notice that the underscore character, which occurs twice in the final line of text in the sample file, is notmatched by the [:alnum:]?pattern

Figure 5-21

If Step 4 is replaced by clicking the Find button, assuming that the cursor is at the beginning of the testfile, the initial uppercase Awill be matched, because that is the first matching character

Trang 14

How It Works

If the regular expression engine starts at the position immediately before the Aof the first line of the testfile, the Ais tested against the pattern [:alnum:]? There is a match because uppercase Ais an alpha-betic character The matched text is highlighted in reverse video

When the Find All button is used, after that first successful match the regular expression engine moves

to the position between Aand Band attempts to match against the following character, B That matches,and so it, too, is highlighted in reverse video The regular expression engine moves to the next positionand then matches the C, and so on When the newline character is reached, there is no match against thepattern [:alnum:]?, and the regular expression engine moves on to the position after the newline char-acter and attempts to match the next character

When the regular expression engine reaches the position before the underscore character and attempts tomatch that character, there is no match, because the underscore character is neither an alphabetic charac-ter nor a numeric digit

Exercises

1. You have a document that contains American English and British English State a problem nition to locate occurrences of license(U.S English) and licence(British English) Specify aregular expression pattern using a character class to find both sequences of characters

defi-2. The pattern (20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]was used earlier in thischapter to match dates As written, this pattern would allow months such as 00, 13, or 19andallow days such as 00, 32, and 39 Modify the relevant components of the pattern so that onlymonths 01through 12and days 01through 31are allowed

Trang 16

String , Line, and Word

Boundaries

This chapter looks at metacharacters that match positions before, between, or after characters

rather than selecting matching characters These positional metacharacters complement the

meta-characters that were described in Chapter 4, each of which signified meta-characters to be matched.For example, you will see how to match characters, or sequences of characters, that immediatelyfollow the position at the beginning of a line In normal English you might, for example, say thatyou want to match a specified sequence of characters only when they immediately follow thebeginning of a line or the beginning of the whole test text The implication is that you don’t want

to match the specified sequence of characters if they occur anywhere else in the text So using apositional character in this way can significantly change the sequences of characters that match orfail to match

Equally, you might want to look for whole words rather than sequences of characters or sequences

of characters when they occur in relation to the beginning or end of a word Many regular sion implementations have positional metacharacters that allow you to do that

expres-This chapter provides you with the information needed to make matches based on the position of

a sequence of characters

The term anchor is sometimes used to refer to the metacharacters that match a

posi-tion rather than a character.

In some documentation (for example, the documentation for NET regular sion functionality), these same positional metacharacters are termed atomic zero-width

expres-assertions.

Trang 17

This chapter looks at how to do the following:

❑ Use the ^metacharacter, which matches the position at the beginning of a string or a line

❑ Use the $metacharacter, which matches the position at the end of a string or a line

❑ Use the \<and \>metacharacters to match the beginning and end of a word, respectively

❑ Use the \bmetacharacter, which matches a word boundary (which can occur at the beginning

of a word or at the end of a word)

String , Line, and Word Boundaries

Metacharacters that allow you to create patterns that match sequences of characters that occur at specificpositions can be very useful

For example, suppose that you wanted to find all lines that begin with the word The With the niques you have seen and used in earlier chapters, you can readily create a literal pattern to match the sequence of characters The, but with those techniques you haven’t been able to specify where thesequence of characters occurs in the text, nor whether it is a whole word or forms part of a longer word.The relevant pattern, written as The, would match sequences of characters such as There, Then, and so

tech-on at the beginning of a sentence in addititech-on to the word Theand would also match parts of personal orbusiness names such as Theodoreor Theatre

Similarly, assuming that you used the pattern Thein a case-insensitive mode, you would also (possibly

as an undesired side effect) match sequences of characters such as thein the word lathe At othertimes, you might want to find a sequence of characters only when they occur at the end of a word (againfor example, the thein lathe)

The ^and $metacharacters, which are used to specify a position in relation to the beginning and end of

a line or string, are discussed and demonstrated first

when applied to the test text

The Thespian Theatre opens at 19:00

would match the sequence of characters Thein the words The, Thespian, and Theatre

Trang 18

However, the same pattern preceded by the ^metacharacter

^Thewhen applied to the same test text would match only the sequence of characters Thein the word Thebecause that sequence of characters occurs immediately after the start of the string

Try It Out Theatre ExampleUse the very simple test text in the file Theatre.txt:The Thespian Theatre opens at 19:00

1. Open PowerGrep, and check the Regular Expression check box.

2. Enter the pattern The in the Search text box.

3. Enter C:\BRegExp\Ch06 in the Folder text box.

4. Enter Theatre.txt in the File Mask text box.

5. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-1.Notice that the information in the Results area indicates three matches for the pattern The

Figure 6-1

6. Edit the regular expression pattern so that it reads ^The

7. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-2.Notice that there is now only one match, in contrast to the three matches before you edited theregular expression pattern

The ^metacharacter, when used outside a character class, does not have the negation meaning that it has when used as the first character inside a character class.

Trang 19

Figure 6-2

How It Works

The regular expression engine starts at the position before the first character in the test file The firstmetacharacter in the pattern, the ^metacharacter, is matched against the regular expression engine’scurrent position Because the regular expression engine is at the beginning of the file, the condition spec-ified by the ^metacharacter is satisfied, so the regular expression engine can proceed to attempt tomatch the other characters in the regular expression pattern The next character in the pattern, the literaluppercase T, is matched against the first character in the test file, which is uppercase T There is a match,

so the regular expression engine attempts to match the next character in the pattern, lowercase h, againstthe second character in the test text, which is also lowercase h The literal hin the pattern matches the lit-eral hin the test text Then the regular expression engine attempts to match the literal ein the patternagainst the third character in the test text, lowercase e There is a match Because all components of theregular expression match, the entire regular expression matches

If the regular expression attempts a match when the current position is anything other than the positionbefore the first character of the test text, matching fails on that first metacharacter, ^ Therefore, the pat-tern as a whole cannot match Matching fails except at the beginning of the test text

The ^ Metacharacter and Multiline Mode

In the preceding example, the test text is a single line, so you were able to examine the use of the ^metacharacter without bothering about whether the ^metacharacter would match the beginning of thetest text or the beginning of each line, because the two concepts were the same However, in several toolsand languages, it is possible to modify the behavior of the ^metacharacter so that it matches the positionbefore the first character of each line or only at the beginning of the first line of the test file

When using the Komodo Regular Expression Toolkit, for example, the following test text

This

Then

will fail to find a match when the pattern is as follows:

^The

Trang 20

Figure 6-3 shows the failure to match.

Figure 6-3

However, if you check the Multi-Line Mode check box, the sequence of characters Theon the second line

is highlighted and in the gray area below the message Match succeeded: 0 groupsis displayed, asyou can see in Figure 6-4

Figure 6-4

Trang 21

When multiline mode is used, the position after a Unicode newline character is treated in the same way

as the position that comes at the beginning of the test file A Unicode newline character matches any ofthe characters or character combinations that can be used to express the notion of a newline

Not all programming languages support multiline mode How individual programming languages treatthis issue is discussed and, where appropriate, demonstrated in later chapters that deal with individualprogramming languages

Try It Out The ^ Metacharacter and Multiline Mode

This exercise uses the test file TheatreMultiline.txt:

The Thespian Theatre opens at 19:00

Then theatrical people enter the building

They greatly enjoy the performance

The interval is the time for liquid refreshment

Notice that each line begins with the sequence of characters The

Some tools, such as PowerGrep, are in multiline mode by default, as shown here

1. Open PowerGrep, and check the Regular Expressions check box

2. Enter the regular expression pattern ^The in the Search text box.

3. Enter C:\BRegExp\Ch06 in the Folder text box Adjust this if you chose to put the downloadfiles in a different folder

4. Enter TheatreMultiline.txt in the File Mask text box.

5. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-5.Notice the character sequence Theat the beginning of each line is highlighted as a match, indi-cating the default behavior of multiline mode

Figure 6-5

Trang 22

The $ Metacharacter

The ^metacharacter allows you to be specific about where a matching sequence of characters occurs atthe beginning of a file or the beginning of a line The $metacharacter provides complementary function-ality in that it specifies matches in a sequence of characters that immediately precede the end of a line or

the$

Try It Out The $ MetacharacterThis example demonstrates the use of the pattern the$:

2. Enter the pattern the$ in the Search text box.

3. Enter C:\BRegExp\Ch06 in the Folder text box.

4. Enter Lathe.txt in the File Mask text box.

5. Click the Search button, and inspect the results displayed in the Results area, as shown in Figure 6-6

Figure 6-6

Notice that there is only one match and that the sequence of characters Theat the beginning ofthe line does not match nor does the word the, which precedes the word lathe

6. Delete the $metacharacter in the Search text box

7. Click the Search button, and inspect the revised results in the Results area.

Trang 23

Notice that with the $metacharacter deleted the pattern now has three matches (not illustrated) Thefirst is the Theat the beginning of the test text That matches because the default behavior in PowerGrep

is a case-insensitive match The second is the word thebefore the word lathe The third is the charactersequence the, which is contained in the word lathe

How It Works

The default behavior of PowerGrep is case-insensitive matching When the regular expression enginestarts to match after Step 6, it starts at the position before the initial The The regular expression engineattempts to match Theand succeeds Finally, the regular expression engine attempts to match the $metacharacter against the position that follows the lowercase ein the test text That position is not theend of the test string; therefore, the match fails Because one component of the pattern fails to match, thewhole pattern fails to match

Attempted matching progresses through the test text The first three characters of the pattern matchwhen the regular expression engine is at the position immediately before the word the However, asdescribed earlier, the $ metacharacter fails to match; therefore, there is no match for the whole pattern.However, when the regular expression engine reaches the position after the aof latheand attempts tomatch, there is a match The first character of the pattern, lowercase t, matches the next character, thelowercase tof lathe The second character of the pattern, lowercase h, matches the hof lathe Thethird character of the pattern, lowercase e, matches the lowercase eof lathe The $metacharacter of the pattern does match, because the eof latheis the final character of the test string Because all com-ponents of the pattern match, the whole pattern matches, and the character sequence theof lathe ishighlighted as a match in Figure 6-6

The $ Metacharacter in Multiline Mode

Like the ^metacharacter, the $metacharacter can have its behavior modified when it used in multilinemode However, not all tools or languages support multiline mode for the $metacharacter

Tools or languages that support the $metacharacter in multiline mode use the $metacharacter to matchthe position immediately before a Unicode newline character Some also match the position immediatelybefore the end of the test string, but not all do, as you will see later

The sample file, ArtMultiple.txt, is shown here:

A part for his car

Wisdom which he wants to impart

Leonardo da Vinci was a star of medieval art

At the start of the race there was a false start

Notice that to make the example a test of the $metacharacter, the period that might be expected at theend of each sentence has been omitted

Trang 24

Try It Out The $ Metacharacter in Multiline ModeThis example demonstrates the use of the $metacharacter with multiline mode:

2. Enter the pattern art in the Search text box.

3. Enter the text C:\BRegExp\Ch06 in the Folder text box.

4. Enter the text ArtMultiple.txt in the File Mask text box.

5. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-7.Notice that occurrences of the sequence of characters artare matched when they occur at theend of a line and at other positions — in this example, partin Line 1 and the first occurrence ofstartin Line 7

Figure 6-7

6. Edit the regular expression pattern to add the $metacharacter at the end, giving art$

7. Click the Search button, and inspect the results in the Results area, as shown in Figure 6-8.Notice that the matches for the pattern artthat were previously present in the words partinLine 1 and the first occurrence of startin Line 7 are no longer present, because they do notoccur at the end of a line The $metacharacter means that matches must occur at the end of

a line

Trang 25

When an attempt is made to match artin part in the first line, the first three characters of the regularexpression pattern match; however, the final $metacharacter of the pattern art$fails to match Because

a component of the pattern has failed to match, the entire pattern fails to match

When the regular expression engine has reached a position immediately before the aof impart, it canmatch the first three characters of the pattern art$successfully against, respectively, the a, r, and tofimpart Finally, an attempt is made to match the $metacharacter against the position immediately fol-lowing the tof impart Because that position immediately precedes a Unicode newline character (that

is it is the final position on that line), there is a match Because all the components of the pattern match,the entire pattern matches

When the regular expression engine has reached a position immediately before the aof the secondstarton the final line, it can match the first three characters of the pattern art$successfully against,respectively, the a, r, and tof start Finally, an attempt is made to match the $metacharacter againstthe position immediately following the tof start Because that position immediately precedes the end

of the test string (that is, it is the final position of the test file), there is a match Because all the nents of the pattern match, the entire pattern matches

Trang 26

compo-Using the ^ and $ Metacharacters Together

Using the ^and $metacharacters together can be useful to identify lines that consist entirely of desiredcharacters This can be very useful when validating user input, for example

The sample text, ABCPartNumbers.txt, is shown here:

ABC123There is a part number ABC123

ABC234

A purchase order for 400 of ABC345 was received yesterday

ABC789Notice that some lines consist only of a part number, whereas other lines include the part number as part

of some surrounding text

The intention is to match lines that consist only of a part number The problem definition is as follows:

Match a beginning of line position, followed by the literal sequence of characters A , B , and C , lowed by three numeric digits, followed by a position that is either the end-of-line position or an end-of-string position.

fol-Try It Out Matching Part NumbersThis example demonstrates using the ^and $metacharacters in the same pattern:

1. Open OpenOffice.org Writer, and open the test file ABCPartNumbers.txt

2. Open the Find & Replace dialog box, using the Ctrl+F keyboard shortcut, and check the RegularExpressions and Match Case check boxes

3. Enter the pattern ^ABC[0-9]{3}$ in the Search For text box.

4. Click the Find All button, and inspect the highlighted text, as shown in Figure 6-9 Notice howthree occurrences of a sequence of characters representing a part number are highlighted asmatches, while two occurrences of a part number are not highlighted because they are notmatches

Trang 27

Figure 6-9

How It Works

The regular expression engine begins the matching process at the start of the test file It attempts tomatch the ^metacharacter against the current position There is a match It next attempts to match theliteral character Ain the pattern against the first character in the line, which is uppercase A There is amatch The matching process is repeated successfully for the literal characters Band C Then the regularexpression engine attempts to match the pattern [0-9]{3} It attempts to match the character class[0-9]against the character 1in the test text That matches It then proceeds to match the character class[0-9]a second time, this time against the character 2 That also matches It next proceeds to match thecharacter class [0-9]for a third time, as indicated by the {3}quantifier, against the character 3 That too matches Finally, it attempts to match the $metacharacter against the position following the

Trang 28

character 3 That matches because it immediately precedes a Unicode newline character Each nent of the pattern matches; therefore, the entire pattern matches.

compo-At the beginning of the second line, the regular expression successfully matches the ^metacharacter Itnext attempts to match the literal character Ain the pattern against the first character on the line, anuppercase T The attempt at matching fails Any subsequent attempt to match on that line fails when theattempt is made to match the ^metacharacter because the position is not at the beginning of the line

Matching Blank Lines

One of the potential uses of the ^and $metacharacters together is to match blank lines The followingpattern should match a blank line, because the ^metacharacter signifies the beginning of the line and the

$metacharacter signifies the position immediately either before a Unicode newline character or the end

of the test string

^$

However, not all tools support this pattern

The test file, WithBlankLines.txt, is shown here:

Line 1Line 3 which follows a blank lineLine 5 which follows a second blank lineLine 7 which follows a third blank line

After Line 7, there are two further blank lines to end the test file.

Try It Out Replacing Blank Lines

1. Open OpenOffice.org Writer, and open test file WithBlankLines.txt

2. Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the RegularExpressions and Match Case check boxes

3. Enter the pattern ^$ in the Search For text box.

4. Click the Find All button, and inspect the results, as shown in Figure 6-10

Trang 29

Figure 6-10

Each blank line, except the last two, is highlighted as a match If you try to scroll down, you willfind that OpenOffice.org Writer has lost one of the blank lines that is present if you open theWithBlankLines.txtfile in Notepad If you manually reenter one of the blank lines thatOpenOffice.org Writer strips out, an additional blank line will match A blank line at the end of

a file seems not to match in OpenOffice.org Writer

5. Click the Replace All button, and inspect the results, as shown in Figure 6-11 Notice that thethree previously highlighted blank lines have been deleted

Trang 30

Figure 6-11

How It Works

The second line of the original test file is a blank line When the regular expression engine is at the tion at the beginning of that blank line, matching is attempted against the ^metacharacter There is amatch Without moving its position, the regular expression engine then attempts to match the $meta-character against the same position Because that position immediately precedes a Unicode newlinecharacter, there is a match for the $metacharacter, too Therefore, the entire pattern matches InOpenOffice.org Writer, the matching of the blank line leads to the entire width of the text area on thatline being highlighted

posi-When the regular expression engine is at the beginning of the third line of the original file, it firstattempts to match the ^metacharacter That matches It next attempts to match the $metacharacteragainst the same position Because the position is followed by the character uppercase L, it is not theposition that precedes a Unicode newline character Therefore, the attempt at matching fails

Trang 31

Working with Dollar Amounts

Because the $metacharacter in a regular expression pattern indicates the end-of-line (or end-of-string)position, you cannot use that metacharacter to match the dollar currency symbol in a document Tomatch the dollar sign in a string, you must use the \$escape sequence

The sample file, DollarUsage.txt, is used to explore how to use the \$escape sequence:

The pound, £, and US dollar, $, are major global currencies

As you can see, the $sign may occur in situations other than simply being at the beginning of a sequence

of numeric digits For example, the first line indicates how the dollar sign might appear in a piece of rative text The third line, 99,00$, indicates how a dollar amount might be written in a non-Englishlocale or, perhaps, how it might be written by someone who is not a native speaker of English

nar-Matching a literal $sign is straightforward; you can simply use the following regular expression pattern,which will match all occurrences of the dollar sign in text:

\$

Figure 6-12 shows the application of that simple pattern in PowerGrep

Suppose that you want to detect a dollar sign only when the dollar sign is followed by numeric digits.Even something seemingly this simple may not be entirely straightforward For example, the third-to-last line has a space character following the $sign, which you need to take into account if you want tomatch all relevant occurrences of the $sign:

$ 0.99

Trang 32

Depending on the regular expression implementation, you can express a pattern for numeric digits inseveral ways: \d, [0-9], and [:digit:].

First, try to match situations where a dollar sign is followed by one or more numeric digits, followed by

a period, followed by zero or more numeric digits The following pattern expresses that:

\$[0-9]+\.[0-9]*

The \$matches a literal dollar sign The character class [0-9]matches a numeric digit, and the +tifier indicates that there is at least one numeric digit Following that is a literal period character indi-cated by the escape sequence \ Finally, the pattern [0-9]*indicates that zero or more numeric digitscan occur after the period

quan-Figure 6-13 shows this pattern applied against DollarUsage.txtwhen using OpenOffice.org Writer.Notice that only three of the examples in DollarUsage.txtare matched Can you see, for example,why the examples $1,000,000and $1000have not been matched?

Trang 33

\$ *[0-9]+\.?[0-9]*

By using the *quantifier after the space character in the preceding pattern, you can allow for situationswhere there is more than a single space character after the dollar sign

Trang 34

Figure 6-14

Revisiting the IP Address Example

In Chapter 5, we spent some time looking at how you could use character classes to match IP addresses,using the following sample file, IPLike.txt:

12.12.12.12255.255.256.25512.255.12.255256.123.256.1238.234.88.55196.83.83.1918.234.88,5588.173.71.66241.92.88.103

Trang 35

Now that you have looked at the meaning and use of the ^and $metacharacters, you are in a position totake that example to a successful conclusion.

Try It Out Matching IP Addresses

These instructions assume that you have closed OpenOffice.org Writer

1. Open OpenOffice.org Writer, and open the test file IPLike.txt

3. Enter the regular expression pattern ^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.)

{3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$in the Search For text box

4. Click the Find All button, and inspect the results, as shown in Figure 6-15 Notice that the linescontaining a value of 256are not matched, which is what you wanted

The regular expression pattern that works is shown here:

9][0-9]|[1-9][0-9]|[0-9])$

^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-Figure 6-15

Trang 36

How It Works

First, let’s break the regular expression down into its component parts

The initial ^metacharacter indicates that there is a match only when matching is being attempted from aposition at the beginning of a line

The following component indicates several options for numeric values, each of which is followed by aliteral period in the test text:

((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}

Remember that the escape sequence that matches a literal period is \ If you had used a period in thepattern, the test text would have matched, but so would any alphanumeric character This would haveled to undesired matches such as the following, which is clearly not an IP address:

12G255F12H255Using the metacharacter would have lost much of the specificity that you obtain by using the \.metacharacter

The first time the following pattern is processed, it immediately follows the position that indicates thestart of a line:

((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}

That means a match succeeds if any of the options inside the nested parentheses are found between thestart-of-line position and a literal period character Given the way the options are constructed, onlynumeric values from 0to 255are matched

The second time the following pattern is matched, you know that it is preceded by a literal period acter (because the second attempt at matching follows the first attempt, which you know ends with a lit-eral period):

Because the pattern ends in a $metacharacter, you know that it matches only a numeric value from 0to

255only when it follows a literal period character (which was the final character to match the third attempt

to use the earlier component of the pattern) and when it is followed by a Unicode newline character

Trang 37

What Is a Word?

The notion of what constitutes a word might seem, at first sight, to be obvious But if you were asked tosay which of the sequences of characters on the following lines were words, what would your answerbe? And what criteria would you use to arrive at your opinion?

Clearly, it isn’t realistic to expect a text processor to have knowledge about what is or isn’t a word in English,French, German, or any of a host of other languages Similarly, you can’t expect a text processor to haveknowledge in all technical areas So you need another technique — a more mechanistic technique — to allowidentification of word boundaries

Identifying Word Boundaries

A word boundary can be viewed as two positions: one at the beginning of a sequence of characters thatform a word and one at the end of a sequence of characters that form a word

Depending on which tools or languages you use, there are metacharacters that match a word-boundaryposition occurring at the beginning of a word, a word-boundary position occurring at the end of a word,

or both

The \< Syntax

The \<metacharacter identifies a word-boundary position occurring at the beginning of a word It ispreceded by a character that is not an alphabetic character (for example, a space character) or is a beginning-of-line position

A simple sample file, BoundaryTest.txt, is shown here:

ABC DEF GHI

GHI ABC DEF

ABC DEF GHI

CAB CBA AAA

Trang 38

The problem definition is as follows:

Match an uppercase A when it occurs immediately following a word boundary.

In other words, match an uppercase Awhen it is preceded by a nonword character or by a start-of-string

or start-of-line position

Try It Out Matching a Beginning-of-Word Word Boundary

1. Open OpenOffice.org Writer, and open the file BoundaryTest.txt

3. Enter the pattern \<A in the Search For text box.

4. Click the Find All button, and inspect the results, as shown in Figure 6-16.

Figure 6-16

Trang 39

How It Works

On the first line, the Aof ABCfollows the start-of-text position, so there is a match

On the second line, the Aof ABCfollows a space character (which is a nonword character), so there is amatch

On the third line, the Aof ABCfollows a start-of-line position, so there is a match for the pattern \<A

On the final line, the Aof CABhas an alphabetic character before it, so the pattern \<Adoes not match.The Aof CBAis followed by a nonword character but is preceded by an alphabetic character, so the pat-tern \<Adoes not match

The first Aof AAAis preceded by a nonword character, so the pattern \<Amatches However, the secondand third Aof AAAis preceded by an alphabetic character and does not match

The \>Syntax

The \>metacharacter signifies a word boundary that occurs at the end of a sequence of word characters

In other words, it matches a word boundary that occurs at the end of a word

The test file, EndBoundary.txt, is shown here:

Theodore said “This is a lathe

I shaved today and my new shaving cream made a good lather

A lathe is a tool for turning wood or metal

The Thespian Theatre is something I am loathe to attend

The quick brown fox jumped over the lazy dog

The task is to match the sequence of characters thewhen they occur before a word boundary at the end

of a word

Try It Out The \> Metacharacter

1. Open OpenOffice.org Writer, and open the file EndBoundary.txt

2. Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the RegularExpressions check box, but do not check the Match Case check box, because you want a case-insensitive search on this occasion

3. Enter the pattern the\> in the Search For text box.

4. Click the Find All button, and inspect the results, as shown in Figure 6-17.

Tiêu đề	Beginning Regular Expressions
Trường học	University of Example
Chuyên ngành	Computer Science
Thể loại	Tài liệu
Năm xuất bản	2005
Thành phố	Example City

Định dạng
Số trang	78
Dung lượng	3,12 MB