Beginning Regular Expressions 2005 phần 4 pot

If any attempt to match fails, thewhole pattern fails, and the regular expression engine moves forward through the text attempting tomatch the character sequence Jekyll.. Sosensitivity a

Trang 2

6. Click the Find First icon and then click the Find Next icon twice, observing each time what acter sequence is or is not matched.

char-Figure 8-7 shows the appearance after the Find Next icon has been clicked twice With the fication of the regular expression, all three occurrences of the character sequence Andrewnowmatch

modi-Figure 8-7

7. Now that you know that each occurrence of the relevant string matches, you can modify theregular expression to create two groups between which you can insert the desired apostrophe

to make Andrew’spossessive

Modify the regular expression in the Match tab to (Andrew)(s)(?=\b)

8. Using the Find First and Find Next icons, confirm that all three occurrences of the slightly fied desired character sequence Andrewsmatch

modi-9. Click the Replace tab In the lower pane on the Replace tab, type $1’$2.

10. On the Test tab, click the Replace All icon, and inspect the results in the lower pane on the Testtab (You may need to adjust the window size to see all the results.)

Figure 8-8 shows the appearance after this step

Trang 3

In the second line, the character sequence Andrewis followed by sand a period character The secondlookahead constraint is satisfied.

On the third line, the character sequence Andrewis followed by an sand then a question mark Becausethe question mark is in neither lookahead, the lookahead constraint is not satisfied

When the regular expression pattern is changed to Andrew(?=s\b), when Andrewis matched, thelookahead constraint is an sfollowed by a word boundary There is a word boundary following eachAndrewsand the following character on all three lines In Line 1, there is a word boundary before thespace character In Line 2, there is a word boundary before the period character In Line 3, there is a word boundary before the question mark So each occurrence of Andrewmatches

Trang 4

When the regular expression is modified to (Andrew)(s)(?=\b), you capture the character sequenceAndrewin $1and capture the sin $2 The lookahead does not capture any characters So to insert anapostrophe, you want $1(Andrew) to be followed by an apostrophe to be followed by $2(a lowercase s).

char-(?<=Dr )JekyllThe component (?<=Dr )indicates the sequence of characters that is tested for as a lookbehind, andthe component Jekyllmatches literally

Positive Lookbehind

A positive lookbehind is a constraint on matching Matching occurs only if the pattern to be matched ispreceded by the pattern contained in the lookbehind assertion

Try It Out Positive Lookbehind

1. Open the Komodo Regular Expression Toolkit, and delete any residual regular expression andsample text

2. In the Enter a String to Match Against area, enter the test text, Mr Hyde and Dr Jekyll are

char-acters in a famous novel

3. In the Enter a Regular Expression area, enter the pattern (?<=Dr )Jekyll.

4. Inspect the highlighted text in the String to Match Against area and the description of the results

in the gray area below, Match succeeded: 0 groups.Figure 8-9 shows the appearance Notice that the sequence of characters Jekyllis highlighted

5. Edit the regular expression pattern to read (?<=Mr )Jekyll

6. Inspect the description of the results in the gray area, No matches found

7. Edit the regular expression pattern to read ((?<=Mr )|(?<=Mister ))Hyde Ensure thatthere is a space character after the rof Mister If that is omitted, there will be no match

Trang 5

Figure 8-9

8. Inspect the description of the results in the gray area, Match succeeded: 1 group Alsonotice that the character sequence Hydeis highlighted

9. Edit the Mr.in the test text to read Mister

10. Inspect the gray area again Again, the description is Match succeeded: 1 group Figure 8-10shows the appearance

Figure 8-10

Trang 6

How It Works

The following description of how the regular expression engine operates is a conceptual one and maynot reflect the approach taken by any individual regular expression engine The text matched is thesequence of characters Jekyll

Matching starts at the beginning of the test text The character following the regular expression’s tion is checked to see whether it is an uppercase J If so, that is matched, and an attempt is made to matchthe other characters making up the sequence of characters Jekyll If any attempt to match fails, thewhole pattern fails, and the regular expression engine moves forward through the text attempting tomatch the character sequence Jekyll

posi-If a match is found for the character sequence Jekyll, the regular expression engine is at the positionimmediately before the Jof Jekyll It checks that the immediately preceding character is a space charac-ter If so, it then tests if the character before that is a period character If so, it tests if the character beforethat is a lowercase r Finally, it tests if the character before that is an uppercase D Because matching ofJekyllwas successful, and the constraint that the character sequence Jekyllbe preceded by the charac-ter sequence Dr (including a space character) was satisfied, the whole regular expression succeeds.When you edit the pattern to read (?<=Mr )Jekyll, the character sequence Jekyllis successfullymatched as before However, when the regular expression engine checks the characters that precede that character sequence, the constraint fails, because despite the fact (reading backward) that the spacecharacter, the period character, and the lowercase rare all present, there is no preceding uppercase D.Because the lookbehind constraint is not satisfied, there is no match

It is possible to express alternatives in lookbehind The problem definition might read as follows:

Match the character sequence Hyde if it is preceded by EITHER the character sequence Mr ing a final space character) OR by the character sequence Mister (including a final space character).

(includ-After changing the pattern to read ((?<=Mr )|(?<=Mister ))Hyde, the regular expression engineattempts to match the character sequence Hyde When it reaches the position immediately before the H

of Hydeit will successfully match that character sequence It then must also satisfy the constraint on thesequence of characters that precedes Hyde

The pattern ((?<=Mr )|(?<=Mister ))Hyde uses parentheses to group two alternative patternsthat must precede Hyde The first option, specified by the pattern (?<=Mr ), requires that the sequence

of four characters M, r, a period, and a space character must precede Hyde At Step 8, that four-charactersequence matches

After the edit has been made to the test text, replacing Mr.with Mister, the other alternative comes intoplay The pattern (?<=Mister )requires that a seven-character sequence (Misterplus a space charac-ter) precedes Hyde

The positioning of the lookbehind assertion is important, as you will see in the next example

Trang 7

Try It Out Positioning of Positive Lookbehind

1. Open RegexBuddy, click the Match tab, and enter the regular expression (?<=like )SQL Server.

2. Click the Test tab, click the Open File icon, and open the Databases.txtfile

3. Click the Find First icon, and inspect the highlighted text in the pane in the Test tab, as shown inFigure 8-11

Figure 8-11

4. Edit the regular expression in the Match tab so that it reads SQL Server(?<=like )

5. Click the Find First icon in the Test tab Confirm that there is no now no highlighted text

6. Edit the regular expression in the Match tab so that it reads SQL Server(?<=like SQLServer)

7. Click the Find First icon in the Test tab Confirm that there is again a match in the test text, asshown in Figure 8-12

How It Works

When the pattern is (?<=like )SQL Server, the lookbehind looks behind, starting from the positionimmediately before the Sof SQL Because the character sequence like SQL Serverexists in the testtext, there is a match When the pattern is SQL Server(?<=like ), the lookbehind starts from the posi-tion after the of Because that position is preceded by , not , and the lookbehind is

Trang 8

Figure 8-12

Negative Lookbehind

Negative lookbehind is a constraint on matching Matching occurs only if the pattern to be matched is

not preceded by the pattern contained in the lookbehind assertion.

Try It Out Negative LookbehindFind occurrences of the character sequence SQL Serverthat are not preceded by the character sequencelikefollowed by a space character

1. Open RegexBuddy, click the Match tab, and enter the regular expression (?<!like )SQL Server.

2. Click the Test tab, click the Open File icon, and open the Databases.txtfile

3. Click the Find First icon, and inspect the highlighted text in the pane in the Test tab, as shown inFigure 8-13

4. Look for other matches by clicking the Find Next icon several times Note which occurrences ofSQL Servermatch or don’t match

Trang 9

How to Match Positions

By combining lookahead and lookbehind, it is possible to match positions between characters For example,suppose that you wanted to match a position immediately before the Andrewof the following sample text:

Trang 10

You could state the problem definition as follows:

Match a position that is preceded by the character sequence is followed by a space character and is followed by the character sequence Andrew

You could match that position using the following pattern:

(?<=is )(?=Andrew)

Try It Out Matching a Position

1. Open RegexBuddy On the Match tab, type the regular expression pattern (?<=is )(?=Andrew).

If you used RegexBuddy for the replace example earlier in this chapter, delete the replacementtext on the Replace tab

2. On the Test tab, enter the sample text This is Andrews book.

3. Click the Find First icon, and inspect the information in the lower pane of the Test tab, as shown

in Figure 8-14 On-screen, you can see the cursor blinking at the position immediately before theinitial Aof Andrews

Figure 8-14

Trang 11

How It Works

The regular expression engine starts at the beginning of the document and tests each position to seewhether both the lookbehind and lookahead constraints are satisfied In the test text, only the positionimmediately before the initial Aof Andrewssatisfies both constraints It is, therefore, the only positionthat matches

Adding Commas to Large Numbers

One of the useful ways to apply a combination of lookbehind and lookahead is adding commas to largenumbers

Assume that the sales for the fictional Star Training Company are $1,234,567 The data would likely bestored as an integer without any commas However, for readability, commas are usual in many situa-tions where financial or other numerical data is presented

The process of adding commas to a large numeric value is essentially to match the position between theappropriate numeric digits and replace that position by a comma

In some European languages, the thousands separator, which is a comma in English, is a period ter Such periods can be added to a numeric value by slightly modifying the technique presented below.

charac-First, let’s look at a numeric value of 1234and how you can add a comma in the appropriate place Youwant to insert the comma at the position between the 1and the 2 The reason to insert a comma in thatposition is that there are three numeric digits between the desired position and the end of the string.Try It Out Adding a Comma Separator to a Four-Digit Number

1. Open RegexBuddy On the Replace tab, enter the pattern (?<=\d)(?=\d\d\d) in the upper pane

and a single comma character in the lower pane

2. On the Test pane, click the Find First icon Confirm that there is a match, as described in thelower pane on the Test tab

3. Click the Replace All icon, and check the replacement text shown in the lower pane on the Testtab (see Figure 8-15) The replacement text is 1,234, which is what you want The regularexpression pattern works for four-digit numbers

Trang 12

Figure 8-15

4. Edit the test text in the upper pane of the Test tab to read 1234567

5. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab.The replacement text is 1,2,3,4,567, which is not what you want All the positions that have

at least three numeric digits to the right have had a comma inserted, as shown in Figure 8-16

6. Edit the pattern to (?<=\d)(?=(\d\d\d)+)

7. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab.The undesired commas are still there

Trang 13

Figure 8-16

8. Edit the pattern to (?<=\d)(?=(\d\d\d)+$)

9. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab (seeFigure 8-17) This is 1,234,567, which is what you want

10. Depending on your data source, the pattern (?<=\d)(?=(\d\d\d)+$)may not work Imagine

if a single character — for example, a period character — follows the last digit of the number towhich you wish to add commas Edit the test text to read Monthly sales figures are1234567

11. Edit the regular expression on the Replace tab to read (?<=\d)(?=(\d\d\d)+\W)

Trang 14

Figure 8-17

How It Works

The pattern (?<=\d)(?=\d\d\d)looks for a position that follows a single numeric digit and precedesthree numeric digits In the sample text 1234, there is only one position that satisfies both the look-behind and lookahead constraints: the position after the numeric digit 1

When the test text is changed to 1234567, the pattern (?<=\d)(?=\d\d\d)matches several times Forexample, the position following the numeric digit 2is preceded by a numeric digit and is followed bythree numeric digits That position therefore satisfies both the lookbehind and lookahead constraints.You need to group the numeric digits into groups of three to attempt to get rid of the undesired commareplacements The pattern (?<=\d)(?=(\d\d\d)+)groups the numeric digits in the lookahead intothrees but fails, as you saw in Figure 8-16, to prevent the unwanted commas At the position following the numeric digit 2, there is still a sequence of three digits following that position, so the position matches

A comma is therefore inserted (although that is not appropriate to formatting norms for numbers)

Trang 15

When the pattern is edited to (?<=\d)(?=(\d\d\d)+$), you get the results you want The position lowing the numeric digit 2now fails to satisfy the lookahead constraint It is followed by five numericdigits, which does not match the pattern (\d\d\d)+.

fol-However, the position after the numeric digit 1still matches It is followed by six numeric digits, whichmatches the pattern (\d\d\d)+ Similarly, the position after the numeric digit 4is matched, because it isfollowed by three numeric digits, which matches the pattern (\d\d\d)+ In both those positions thatmatch, a comma is inserted

Trang 16

Conversely, matching and manipulating undesired data may well corrupt parts of your data.Whether that data corruption leads to minor typos or more serious problems depends on yourdata, what its intended use is, and the extent and severity of the undesired changes you uninten-tionally make to it Again, the undesired effects can impact adversely on customer satisfaction Sosensitivity and specificity are issues to take seriously.

In this chapter, you will learn the following:

❑ What sensitivity and specificity are

❑ How to work out how far you should go in investing time and effort in maximizing tivity and/or specificity

sensi-❑ How to use regular expression techniques to give an optimal balance of sensitivity andspecificity

❑ How the detail of the data source can affect sensitivity and specificity

❑ How to gain a better balance of sensitivity and specificity in the Star Training Companyexample

Trang 17

What Are Sensitivity and Specificity?

Sensitivity is the capacity to match the pattern that you want to match Specificity is the capacity to limit

the character sequences selected by a pattern to those character sequences that you want to detect

The definitions given may feel a little abstract, so the following examples are provided to develop aclearer understanding of the ideas of sensitivity and specificity

Extreme Sensitivity, Awful Specificity

Suppose that you want to match the character sequence ABC It is very easy to achieve 100 percent tivity using the following pattern:

sensi-.*

It selects sequences of zero or more alphanumeric characters

A sample document, ABitOfEverything.txt, is shown here:

This is a random 58#Gooede garbled piece of 8983ju**nk but it is still selected

Sensitivity and specificity are terms derived from quantitative disciplines such as

statistics and epidemiology Broadly, sensitivity is a measure of the number of true

hits you find divided by the total number of true hits you ought to find if you match

all occurrences of the relevant character sequences, and specificity is the number of

hits you find that are true hits divided by the total number of hits you find The

higher the sensitivity, the closer you are, in the context of regular expressions, to

finding all true matches, and the higher the specificity, the closer you are to finding

only true matches.

Trang 18

As you can see, there is a pretty diverse range of content, not all of which is useful However, if youapply the regular expression pattern *you achieve 100 percent sensitivity, because the only occurrence

of the character sequence ABCis matched However, you also select every other piece of text in the ple document, as you can see in Figure 9-1 in OpenOffice.org Writer

sam-Figure 9-1

I introduced this slightly silly example to make an important point It is possible to create very sensitiveregular expression patterns that achieve nothing useful Of course, you are unlikely to use *as a standalone pattern, but it is important to carefully consider the usefulness of the regular expression patterns you create when, typically, the issues will be significantly more subtle

Useful regular expressions keep the 100 percent sensitivity (or something very close to 100 percent) ofthe *pattern but combine it with a high level of specificity

Trang 19

Email Addresses Example

Suppose that you have a large number of documents or an email mail file that you need to search forvalid email addresses The file EmailOrNotEmail.txtillustrates the kind of data that might be con-tained in the material you need to search The content of EmailOrNotEmail.txtis shown here:

Figure 9-2

Trang 20

As the figure shows, all the valid email addresses (which are on lines 4, 5, 9, and 10) are selected Thisgives you 100 percent sensitivity, at least on this test data set In other words, you have selected everycharacter sequence that represents a valid email address But you have, on all the other lines, matchedcharacter sequences that are pretty obviously not email addresses You need to find a more specific pat-tern to improve the specificity of matching.

Look a little more carefully at how an email address is structured Broadly, an email address follows thisstructure:

of the email address The following pattern matches, at a minimum, a single alphabetic character due tothe \w+component of the pattern:

So you could use a lookbehind to allow a match for a period character only when it has been preceded

by at least one alphabetic character This pattern would allow matching of a period character only when

it is preceded by an alphabetic character:

\w*(?<=\w)\.?\w+

Try It Out Email Address

1. Open PowerGrep, and enter the pattern \w*(?<=\w)\.?\w+@.*in the Search text area

2. Enter the folder name C:\BRegExp\Ch09 in the Folder text box Amend, as appropriate, if youdownloaded the sample files to a different directory

3. Enter the filename EmailOrNotEmail.txt in the File Mask text box, and click the Search button.

4. Inspect the results in the Results area Compare the matches shown in Figure 9-2 with the matchesnow shown in Figure 9-3, particularly noting the character sequences that no longer match

Trang 21

Figure 9-3

This is an improvement The pattern is more specific You no longer match the undesired character sequences on lines 1, 2, 7, and 8 However, the character sequence on Line 3,John@somewhere.invalid, is not a valid email address

You can remove that undesired match by making the hostname part of the email address morespecific How specific you want to be is a matter of judgment You know that all hostnames willhave a sequence of alphabetic characters, followed by a period character, followed by three(com, net, org, or biz) or four (info) alphabetic characters For the purposes of this example

we won’t consider hostnames like example.co.uk The following pattern would be an priate pattern to match hostnames that correspond to the structure just described:

appro-\w+\.\w{3,4}

The \w+will match even single character domain names (which are allowed with com, net,and orgdomains) The \.metacharacter matches a single period character, and the \w{3,4}component matches either three or four alphabetic characters

Combining that pattern with your earlier one gives you the following:

One way to approach this is to use a lookahead to specify that following the first match for an @character, another @character does not occur If you continue to assume that only alphabetic char-acters are allowed in an email address, you can specify that you look ahead from the first @charac-ter matched to the first match for a character that is not an alphabetic character or a periodcharacter

Trang 22

You can do that using the following pattern:

\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}

7. Edit the pattern in the Search text area to be

\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}, and click the Search button

8. Inspect the results Figure 9-4 shows the appearance.

9. Modify the pattern in the Search area to be

^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$, and click the Search button

10. Inspect the results Figure 9-5 shows the appearance.

Figure 9-5

Trang 23

Happily, you have now succeeded in avoiding matching the undesired matches on lines 3 and 6 At least

on this simple test data, you have achieved 100 percent sensitivity and 100 percent specificity

The terms sensitivity and specificity come from quantitative sciences, such as statistics and ogy In those contexts, both the sensitivity and specificity are expressed numerically, often as percent- ages So for the preceding example, you have a sensitivity of 100 percent because all true email addresses are detected using your first attempt at a regular expression pattern, and you initially have a specificity

epidemiol-of 40 percent because 6 epidemiol-of the 10 matches are false matches (in the sense that they are not valid email

addresses) By the end of the Try It Out example, the specificity has risen to 100 percent on the test data.

Replacing Hyphens Example

This example looks at another problem that can occur if you are not careful in thinking through themeaning of a regular expression

Assume that you have a collection of text documents that have to be converted into HTML/XHTML.This example focuses on the possible need for replacing a line of hyphens with the HTML/XHTML

<hr>element to create a horizontal ruled line

A simplified sample document, HyphenTest.txt, is used in this example:

A first attempt at expressing the problem definition might be as follows:

Replace any hyphens that occur with the character sequence <hr>

However, that is too imprecise For example, the third line would be replaced with the following:

A more precise statement of the problem definition would be as follows:

Replace any group of consecutive hyphens with the character sequence <hr>

Assume that you will omit the end tag of the hrelement, because many Web browsers have problems ifyou use the empty element tag, <hr/>

If you use the following regular expression pattern to express the idea of one or more hyphens, you canrun into problems for two reasons:

-*

Trang 24

First, not all regular expression engines interpret that pattern correctly The pattern -*means “Match

zero or more hyphens,” which means that the occurrence of zero hyphens is a match Therefore, the text

Fredought to match, which may not be what you expected Why does Fredmatch? Because there arezero hyphens

OpenOffice.org Writer implements the -*pattern as you might intuitively expect, because it matchesonly when at least one hyphen occurs, as shown in Figure 9-6, when it ought to match on each linebecause each line has zero hyphens at the beginning

Trang 25

Figure 9-7

The Sensitivity/Specificity Trade-Off

Sensitivity and specificity are always part of a trade-off Sensitivity and specificity are components of thetrade-off, but the amount of effort required to get 100 percent sensitivity and 100 percent specificity maynot be practical in some situations Some undefined “good” specificity may be enough It’s a trade-off inthat, in the end, only you can judge how much effort is appropriate for the task that you are using regu-lar expressions to achieve

How important are sensitivity and specificity? The answer is, “It depends.” There are many times whenyou will need high sensitivity, 100 percent sensitivity ideally, and at the same time you also need highspecificity At other times, one or the other may be less important This section looks at some of the fac-tors that influence how much importance it is relevant to place on sensitivity and specificity

It depends to a significant extent on who the customer is If you are using regular expressions to achievesomething for your own use, you may not worry too much if you miss one or two matches On the otherhand, if you are conducting a replacement of every occurrence of a company name after a takeover, forexample, it would be serious if sensitivity fell below 100 percent

How Metacharacters Affect Sensitivity and Specificity

In general, the more metacharacters you use, the more specific a pattern becomes The pattern catmatches that sequence of characters whether they refer to a feline mammal or form character sequences

Trang 26

Adding further metacharacters, such as the \bword boundary, makes the use of the character sequencecatin a pattern much more specific The pattern \bcat\bwill match only the word cat(singular).When using specific patterns like that, you need to watch carefully for the possibility of reducing sensi-tivity The pattern \bcat\bwill match catbut won’t match cats, for example If you are interested infinding all references in the document to feline mammals, the \bcat\bpattern may not be the bestoption You may want to allow for the occurrence of the plural form, cats, and the possessive form,cat’s, too The pattern \bcat’?s?’?\bwould match cat, cats, cats’(plural possessive), and cat’s(singular possessive) but would also match cat’, which is unlikely to be a desired match If your data isunlikely to contain the character sequence cat’, the pattern \bcat’?s?’?\bmay be sufficient But if,for some reason, you want to match only cat, cats, cat’s, and cats’, some other, more specific pat-tern will be needed One simple option is as follows:

(cat|cats|cat’s|cats’)

An alternative follows:

ca(t|ts|t’s|ts)Similar issues apply whatever the word or sequence of characters of interest

Sensitivity, Specificity, and Positional Characters

The positional characters explored in Chapter 6 can be expected in many cases to affect both sensitivityand specificity

In the following example, an initial version of the problem definition can be expressed as follows:

Match all occurrences of the sequence of characters t , h , and e case insensitively.

The pattern thewill match twice in the following text:

Paris in the the spring

It will match once in the following text:

The spring has sprung

However, suppose you modify the problem definition to the following:

Match the position at the beginning of a string; then match the sequence of characters t , h , and e case insensitively.

The pattern ^thenow has no match in the first sample text but still has a single match in the second.The effect of adding one or more positional metacharacters depends on the data the pattern is beingmatched against

Trang 27

Sensitivity, Specificity, and Modes

When you specify that a regular expression is to be executed in a case-insensitive or case-sensitive mode,you affect the matches that will be returned Continuing with the preceding example, the pattern ^theapplied case sensitively has no match in either of the two test pieces of text In the second sample text,the ^metacharacter matches the position at the beginning of the string, but the lowercase tof the pat-tern does not match the uppercase Tof the test text

Similarly, the use of the period metacharacter (which matches a large range of characters) can be

switched to match or not match a newline character

Sensitivity, Specificity, and Lookahead and Lookbehind

When you add lookahead or lookbehind to an existing regular expression, you may have no effect onsensitivity ,or you may adversely impact it Equally, you may improve specificity or, less likely, it maystay the same

If a lookbehind is carefully crafted, it won’t reduce sensitivity However, if you make an error in the tern inside the lookbehind, you will fail to match when you intended to match, reducing sensitivity.Suppose that you wanted to find information about Anne Smith The following pattern would matchwhen the spelling of Anneis correct, and it is followed by exactly one space character:

Similarly, lookahead can reduce sensitivity For example, suppose that you want to match all occurrences

of the character sequence John The following pattern would match a word boundary, then the desiredcharacter sequence John, and then check if the following character is a space character:

\bJohn(?= )

However, if the test text is as follows, the lookahead is too specific and causes what is likely to be adesired match to fail:

I went with John, and Mary on a trip

Modifying the lookahead to (?=\b)or (?=\W)would prevent the problem caused by the occurrence of

an unanticipated comma

How Much Should the Regular Expressions Do?

Trang 28

expressions as a developer, you will typically be using regular expressions inside code written in Java,JavaScript, VB.NET, and so on, or you may be applying regular expressions to data retrieved from a rela-tional database So how much should you expect the regular expressions to do, and how much can yousafely assume that your other code or the error checking in a database already does?

For example, suppose you have a collection of HTML documents that include IP addresses, and yourtask is to amend the style that the IP addresses are displayed in Suppose that initially, IP addresses arenested inside the start and end tags for HTMLbelements, as in the following:

What pattern should you use to find such IP addresses? Should you just assume that the data youreceive will be correctly formed (including having no values of 256or more), or should you include amore complex pattern so that the regular expression will match only correctly formed IP addresses?

If you assume that the IP addresses are already correctly formed or are checked by some other part ofyour code, you could use a fairly simple pattern such as the following:

Knowing the Data, Sensitivity, and Specificity

One of the key issues that affect how well you achieve sensitivity and specificity is how well you stand the data to which you are applying regular expressions Of course, your understanding of the regular expression syntax and techniques supported by your chosen language or tool is important, too

Trang 29

under-But if you don’t really understand the data you are working with, even a regular expression with correctsyntax can turn up unexpected results, by lowering either sensitivity or specificity.

Abbreviations

Abbreviations can pose significant potential for lowering the sensitivity of a regular expression Forexample, titles such as Dr(with no period character) and Dr.(with a period character) are frequentlyused as abbreviations for Doctor In some circumstances, you may be confident that only one form isused in the data source If all three forms occur in the data, a pattern like the following will be necessary

to avoid missing some desired matches:

(Doctor|Dr.|Dr)

Similar issues arise when handling data that includes information about qualifications For example, if

a Doctor of Philosophydegree is of interest, it will often be written as PhD(no space character orperiod character), Ph.D.(two period characters), or Ph D.(one space character, two period characters)

To match the options just mentioned, a pattern such as the following would be satisfactory:

Ph\ ?D\.?

It includes the \.metacharacter twice with a ?quantifier, which matches each of the optional periodcharacter(s) that can occur in some of the options Depending on where the degree was obtained, theform D.Phil.(two period characters) with option DPhil(no period characters) can also occur To allowfor these additional forms, a pattern such as the following would be needed:

(Ph\.?D\.?|D\.?Phil\.?)

Characters from Other Languages

The focus of this book is the use of regular expressions with English, including U.S English and BritishEnglish However, with the increasing globalization of trade, the inclusion of words and characters fromother languages commonly occurs in documents that are, for the most part, written in English

In Canada, many official documents are in French Therefore, many characters with accents will be tinely encountered

rou-In documents written in English, there can be differences in how words are written For example, the testtext

“Nostalgia is not what it used to be.” That is my favorite cliche

might equally have been written as follows:

“Nostalgia is not what it used to be.” That is my favorite cliché

Trang 30

The second version includes the acute character éjust before the period, which concludes the sentence.

To match both forms, you would need to use a pattern such as the following:

clich(e|é)Foreign characters introduce other issues when they occur in HTML The sample document,EAcute.html, uses the notation éinstead of the literal character:

<h2>”Nostalgia is not what it used to be.” That is my favorite cliché.</h2>Yet as you can see in Figure 9-8, the correct character is displayed on the Web page

lan-Sa?ura(v|bh)Similar considerations apply in other foreign names The Russian name for Peter, sometimes transliter-ated as Pyotr, may also be found spelled as Petror Pëtr, or even translated as Peter, and may need to

be matched in all instances To match all these possible forms of the name, you might use a pattern likethis:

P(yo|e|ë)tr

Trang 31

Some European surnames have variant spellings too For example, the surnames Van Nistelrooy(with an intermediate space character) can also be spelled Van Nistelrooijor VanNistelrooy(with

no intermediate space character) So a pattern such as the following would be needed to match thesethree spelling variants:

Van *Nistelroo(ij|y)

Of course, because some such surnames may sometimes be spelled with a lowercase vin van, the lowing pattern might be more sensitive in some situations:

fol-[vV]an *Nistelroo(ij|y)

Sensitivity and How to Achieve It

To achieve maximum sensitivity, you must be aware of all the variant character sequences that can beused to express the character sequence that you want to match

Each time you add some component to a pattern that makes it more specific, you need to carefully sider whether, given the data you are working with, it might also cause some desired matches to fail

con-Specificity and How to Maximize It

Conceptually, the way to maximize specificity is to make the regular expression as specific as possible.There are many techniques to cut out unwanted matches, several of which have been discussed earlier inthis chapter

When attempting to maximize specificity, it is important to give careful consideration to situations thatyou don’t want to match and constructing a pattern that excludes those unwanted character sequencesfrom matching Achieving high specificity involves having an understanding of regular expression syn-tax and the effects of the techniques available to you, and understanding how those techniques affect thedata you are working with

Revisiting the Star Training Company

Example

In Chapter 1, you looked at an example that posed a challenge to a new recruit to the fictional StarTraining Company Having learned a range of techniques in Chapters 2 through 7, you are now in amuch better position to avoid many of the pitfalls that occurred when a simple find and replace wasattempted in Chapter 1

For convenience, the sample text, StarOriginal.txt, is reproduced here:

Trang 32

Star Training CompanyStarting from May 1st Star Training Company is offering a startling special offer

to our regular customers - a 20% discount when 4 or more staff attend a single StarTraining Company course

In addition, each quarter our star customer will receive a voucher for a freeholiday away from the pressures of the office Staring at a computer screen all daymight be replaced by starfish and swimming in the Seychelles

Once this offer has started and you hear about other Star Training customersenjoying their free holiday you might feel left out Don’t be left on the outsidestaring in Start right now building your points to allow you to start out on yourvery own Star Training holiday

Reach for the star Training is valuable in its own right but the possibility of afree holiday adds a startling new dimension to the benefits of Star Trainingtraining

Don’t stare at that computer screen any longer Start now with Star Training iscrucial to your company’s wellbeing Think Star

The problem definition can be expressed as follows:

Match all occurrences of the character sequence S , t , a , and r when that character sequence refers to the Star Training Company Replace each occurrence of the preceding character sequence with the character sequence M , o , o , and n

The objective is to replace all references to the fictional Star Training Company with corresponding ences to the equally fictional Moon Training Company

refer-When faced with a task like this in real life, it can be helpful to view a few sample documents in a texteditor or word processor with search facilities That allows you to enter a pattern to look for occurrences

of character sequences that might be relevant In this case, you can use the simple literal pattern star(all lowercase) and use regular expressions matching in a case-insensitive way

Try It Out Replacing Star with Moon

1. Open the file StarOriginal.txtin OpenOffice.org Writer

2. Open the Find & Replace dialog box using Ctrl+F.

3. Check the Regular Expressions check box, but leave the Match Case check box unchecked,because you want to find all occurrences of the specified pattern in a case-insensitive way

4. Type the pattern star in the Search For text box, and click the Find All button.

5. Inspect the matches shown in Figure 9-9, paying careful attention to any occurrences of thecharacter sequence starthat refer to the Star Training Company

Trang 33

multi-Let’s take time to list the character sequences that you want to match You want to match starin the following:

Star Training

Star

Trang 34

You want to avoid matching starin the following character sequences:

Startingstartlingstar customerStaringstarfishstartedStart rightstart outstar

startlingstareStart now

I don’t routinely take time to lay out desired matches in a list and undesired matches in a second list Butparticularly when you need to get things as close to 100 percent sensitivity and 100 percent specificity aspossible, it makes a lot of sense to make lists like this

Splitting character sequences into desired matches and undesired matches can be really helpful in ing out how sensitive and specific any pattern will prove to be

work-If you decide that a lookahead is the way to proceed (as it probably is), you could try to match alldesired matches using the following pattern:

Star(?= Training)However, if you look at the list of desired matches, you can see immediately that the preceding patternwill fail in a sentence such as Think Star.That’s one of the occurrences of Starfollowed by a periodcharacter

The following pattern, which offers alternation of two lookaheads, fits all the desired matches that youhave seen in the sample text:

Star((?= Training)|(?=\.))Thus, as judged by the sample text, you have 100 percent sensitivity Figure 9-10 shows the precedingpattern being tested against the character sequence Star

It is always wise to consider that the test data you have looked at doesn’t hold all the likely or possiblecharacter sequences that you need to think about One of the exercises in this chapter asks you to modifythe preceding pattern to allow for other possible occurrences that might be relevant to the uses of Starthat are of interest

The patterns that you want to match are, in general, different from those that you want not to match So

it is generally straightforward to be sure that the pattern does not match any of the undesired charactersequences, with one exception: You want to match the five-character sequence of characters Star.(with

an initial uppercase S) but not match the five-character sequence of characters star.(with an initiallowercase s)

Trang 35

If you use the preceding pattern in matching that is case sensitive, there is no problem The undesiredcharacter sequence star.does not match However, if the matching is case insensitive, the undesiredcharacter sequence star.will match, lowering the specificity of the chosen pattern.

Figure 9-10

Exercises

Test your understanding of the material in this chapter using the following exercises:

1. Modify the pattern ^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$, which was oped earlier in the chapter for matching email addresses, so that it matches only hostnames inthe com, net, and orgdomains

devel-2. Modify the pattern in the Star Training Company example to match the character sequencestarso that it also matches in data like the following:

What do you think of Star?

The best training company is Star!

Trang 36

a project over a period of time, you come face to face with a third truism: Regular expressions arehard to maintain The purpose of this chapter is to help you take steps to minimize the effects ofthese three truisms.

A basic consideration that is important not to forget is that regular expressions never occur in lation They are always used to work on data, whether simple or extensive, and are used in thecontext of a tool or a programming language In addition, the developer has a specific purpose,sometimes a complex or subtle business purpose, for the regular expressions that he writes.The problems that arise when using regular expressions can be due simply to being unable towrite patterns that express the matching characteristics that you want Ideally, as you workthrough this book, that problem will become less and less common

iso-In this chapter, you will learn the following:

❑ How to document regular expressions

❑ How to explore the data you are working with

❑ How to create test cases for regular expressions

❑ How to debug regular expressions

Trang 37

Documenting Regular Expressions

Any programming project of significant size can benefit from good documentation It makes the purpose

of many aspects of the project clear and can assist in further development of the code at a future date.Given the compact, cryptic nature of regular expression syntax, it makes good sense seriously to con-sider documenting your approach to the creation of a particular regular expression and what you expectthe parts of the regular expression to do

In many circumstances, your use of regular expressions may be on a very small scale, where it is ing to avoid any documentation Sometimes, no documentation is the only sensible approach For exam-ple, in some situations, such as using regular expressions in Microsoft Word or OpenOffice.org Writer,documenting a regular expression is overkill You want to find or replace a character sequence there andthen in a single document Formal documentation is unnecessary

tempt-However, in more significant tasks or projects, creating documentation can be a useful discipline, serving to make explicit aspects of the task that you might otherwise be tempted to allow to remainambiguous

Document the Problem Definition

The problem definition is a key component in recording your thought process while designing a regularexpression pattern As mentioned in earlier chapters, you may well not get the problem definition suffi-ciently precise the first time round If the problem is a complex one, it may be worth recording a problemdefinition that isn’t what you want so that if you come back to the code in a few months’ time, you will

be reminded of the work you needed to do while designing the regular expression pattern

A first attempt at a problem definition might be very nonspecific or expressed in a way that doesn’timmediately allow definition of a pattern to match what it is hoped to do

A first attempt at a problem definition to solve the Star Training Company problem in Chapter 1 might

be as follows:

Replace Star with Moon

A brute-force search and replace can cause a substantial number of inappropriate changes If you madesuch inappropriate changes across a large number of documents in the absence of recent backups, itcould take a considerable amount of time to rectify the problems that poor use of a literal regular expres-sion caused

Refining a problem definition depends on an understanding of the data You might have text like the following:

Star Training Company

I highly recommend Star

Why not accept this special offer from Star?

recent course with Star - which was great!

Trang 38

You can see the different ways in which desired matches can be expressed You must understand thedata to be able to construct a pattern that will match (and then replace) all of these.

On the other hand, there may be text that contains similar text, which is text that you want to leavealone:

The trainer was good - a real star!

The training was excellent - star training

Star performer among the trainers

Again, if you don’t take time to understand undesired possible matches, you may end up making propriate changes to the documents you are working with

inap-Add Comments to Your Code

Adding comments to your code is a basic task Try to make comments as meaningful as possible, and try

to make them express what the pattern you create is expected to do

Comments such as the following are pretty useless, particularly when you come back to the code to findout why it isn’t doing that:

// This will replace Star with MoonMake the comments meaningful, such as in the following example:

// This matches Star case sensitively, avoiding words like start and star//It matches when Star is followed by a space character and the character sequenceTraining

//or followed by a period (full stop)//or followed by a question markComments like these give a much clearer idea of what was intended and should correspond prettyclosely to components of the regular expression pattern

If you make a false start of some kind in attempting to solve a problem, it can also be useful to include acomment about what doesn’t work and why While it can be embarrassing to admit a mistake in yourthinking, being upfront about the problem is better than wasting time a few weeks later by going downthe same blind alley

Making Use of Extended Mode

When I write code in JavaScript, Java, Visual Basic NET, and various other programming languages Ispace the components of the code out and indent nested components so that the structure of the code iseasily discerned I would never consider jamming sizeable chunks of code onto a single line if it was

Trang 39

avoidable, because that is much harder to read Making code readable and adding comments where theyare most relevant make the coding and maintenance experience a much smoother one.

One of the key advantages of comments on ordinary code is that you can place the comments right next

to the component of the code to which the comments relate It’s far less useful to have comments that are

a screen or two away from the code to which they refer A similar problem can occur in many regularexpression implementations, where you simply cannot put the comments adjacent to the code that theyrefer to

Extended mode is available in languages such as Perl, Java, and PHP It allows you to include comments

on the same line as the pattern component that they describe Keeping a piece of code right next to itsdescription helps cut down on occurrences of misunderstanding code

Extended mode in Perl is indicated by the xmodifier following the second forward slash of the m//operator

To match input from two known users, you could use a simple program such as JimOrFred.pl:

#!/usr/bin/perl -w

use strict;

print “This program will say ‘Hello’ to Jim or Fred.\n”;

my $myPattern = “^(Jim|Fred)\$”;

# The pattern matches only ‘Jim’ or ‘Fred’ Nothing else is allowed

print “Enter your first name here: “;

Định dạng
Số trang	78
Dung lượng	3,09 MB