If any attempt to match fails, thewhole pattern fails, and the regular expression engine moves forward through the text attempting tomatch the character sequence Jekyll.. Sosensitivity a
Trang 26. Click the Find First icon and then click the Find Next icon twice, observing each time what acter sequence is or is not matched.
char-Figure 8-7 shows the appearance after the Find Next icon has been clicked twice With the fication of the regular expression, all three occurrences of the character sequence Andrewnowmatch
modi-Figure 8-7
7. Now that you know that each occurrence of the relevant string matches, you can modify theregular expression to create two groups between which you can insert the desired apostrophe
to make Andrew’spossessive
Modify the regular expression in the Match tab to (Andrew)(s)(?=\b)
8. Using the Find First and Find Next icons, confirm that all three occurrences of the slightly fied desired character sequence Andrewsmatch
modi-9. Click the Replace tab In the lower pane on the Replace tab, type $1’$2.
10. On the Test tab, click the Replace All icon, and inspect the results in the lower pane on the Testtab (You may need to adjust the window size to see all the results.)
Figure 8-8 shows the appearance after this step
Trang 3In the second line, the character sequence Andrewis followed by sand a period character The secondlookahead constraint is satisfied.
On the third line, the character sequence Andrewis followed by an sand then a question mark Becausethe question mark is in neither lookahead, the lookahead constraint is not satisfied
When the regular expression pattern is changed to Andrew(?=s\b), when Andrewis matched, thelookahead constraint is an sfollowed by a word boundary There is a word boundary following eachAndrewsand the following character on all three lines In Line 1, there is a word boundary before thespace character In Line 2, there is a word boundary before the period character In Line 3, there is a word boundary before the question mark So each occurrence of Andrewmatches
Trang 4When the regular expression is modified to (Andrew)(s)(?=\b), you capture the character sequenceAndrewin $1and capture the sin $2 The lookahead does not capture any characters So to insert anapostrophe, you want $1(Andrew) to be followed by an apostrophe to be followed by $2(a lowercase s).
char-(?<=Dr )JekyllThe component (?<=Dr )indicates the sequence of characters that is tested for as a lookbehind, andthe component Jekyllmatches literally
Positive Lookbehind
A positive lookbehind is a constraint on matching Matching occurs only if the pattern to be matched ispreceded by the pattern contained in the lookbehind assertion
Try It Out Positive Lookbehind
1. Open the Komodo Regular Expression Toolkit, and delete any residual regular expression andsample text
2. In the Enter a String to Match Against area, enter the test text, Mr Hyde and Dr Jekyll are
char-acters in a famous novel
3. In the Enter a Regular Expression area, enter the pattern (?<=Dr )Jekyll.
4. Inspect the highlighted text in the String to Match Against area and the description of the results
in the gray area below, Match succeeded: 0 groups.Figure 8-9 shows the appearance Notice that the sequence of characters Jekyllis highlighted
5. Edit the regular expression pattern to read (?<=Mr )Jekyll
6. Inspect the description of the results in the gray area, No matches found
7. Edit the regular expression pattern to read ((?<=Mr )|(?<=Mister ))Hyde Ensure thatthere is a space character after the rof Mister If that is omitted, there will be no match
Trang 5Figure 8-9
8. Inspect the description of the results in the gray area, Match succeeded: 1 group Alsonotice that the character sequence Hydeis highlighted
9. Edit the Mr.in the test text to read Mister
10. Inspect the gray area again Again, the description is Match succeeded: 1 group Figure 8-10shows the appearance
Figure 8-10
Trang 6How It Works
The following description of how the regular expression engine operates is a conceptual one and maynot reflect the approach taken by any individual regular expression engine The text matched is thesequence of characters Jekyll
Matching starts at the beginning of the test text The character following the regular expression’s tion is checked to see whether it is an uppercase J If so, that is matched, and an attempt is made to matchthe other characters making up the sequence of characters Jekyll If any attempt to match fails, thewhole pattern fails, and the regular expression engine moves forward through the text attempting tomatch the character sequence Jekyll
posi-If a match is found for the character sequence Jekyll, the regular expression engine is at the positionimmediately before the Jof Jekyll It checks that the immediately preceding character is a space charac-ter If so, it then tests if the character before that is a period character If so, it tests if the character beforethat is a lowercase r Finally, it tests if the character before that is an uppercase D Because matching ofJekyllwas successful, and the constraint that the character sequence Jekyllbe preceded by the charac-ter sequence Dr (including a space character) was satisfied, the whole regular expression succeeds.When you edit the pattern to read (?<=Mr )Jekyll, the character sequence Jekyllis successfullymatched as before However, when the regular expression engine checks the characters that precede that character sequence, the constraint fails, because despite the fact (reading backward) that the spacecharacter, the period character, and the lowercase rare all present, there is no preceding uppercase D.Because the lookbehind constraint is not satisfied, there is no match
It is possible to express alternatives in lookbehind The problem definition might read as follows:
Match the character sequence Hyde if it is preceded by EITHER the character sequence Mr ing a final space character) OR by the character sequence Mister (including a final space character).
(includ-After changing the pattern to read ((?<=Mr )|(?<=Mister ))Hyde, the regular expression engineattempts to match the character sequence Hyde When it reaches the position immediately before the H
of Hydeit will successfully match that character sequence It then must also satisfy the constraint on thesequence of characters that precedes Hyde
The pattern ((?<=Mr )|(?<=Mister ))Hyde uses parentheses to group two alternative patternsthat must precede Hyde The first option, specified by the pattern (?<=Mr ), requires that the sequence
of four characters M, r, a period, and a space character must precede Hyde At Step 8, that four-charactersequence matches
After the edit has been made to the test text, replacing Mr.with Mister, the other alternative comes intoplay The pattern (?<=Mister )requires that a seven-character sequence (Misterplus a space charac-ter) precedes Hyde
The positioning of the lookbehind assertion is important, as you will see in the next example
Trang 7Try It Out Positioning of Positive Lookbehind
1. Open RegexBuddy, click the Match tab, and enter the regular expression (?<=like )SQL Server.
2. Click the Test tab, click the Open File icon, and open the Databases.txtfile
3. Click the Find First icon, and inspect the highlighted text in the pane in the Test tab, as shown inFigure 8-11
Figure 8-11
4. Edit the regular expression in the Match tab so that it reads SQL Server(?<=like )
5. Click the Find First icon in the Test tab Confirm that there is no now no highlighted text
6. Edit the regular expression in the Match tab so that it reads SQL Server(?<=like SQLServer)
7. Click the Find First icon in the Test tab Confirm that there is again a match in the test text, asshown in Figure 8-12
How It Works
When the pattern is (?<=like )SQL Server, the lookbehind looks behind, starting from the positionimmediately before the Sof SQL Because the character sequence like SQL Serverexists in the testtext, there is a match When the pattern is SQL Server(?<=like ), the lookbehind starts from the posi-tion after the of Because that position is preceded by , not , and the lookbehind is
Trang 8Figure 8-12
Negative Lookbehind
Negative lookbehind is a constraint on matching Matching occurs only if the pattern to be matched is
not preceded by the pattern contained in the lookbehind assertion.
Try It Out Negative LookbehindFind occurrences of the character sequence SQL Serverthat are not preceded by the character sequencelikefollowed by a space character
1. Open RegexBuddy, click the Match tab, and enter the regular expression (?<!like )SQL Server.
2. Click the Test tab, click the Open File icon, and open the Databases.txtfile
3. Click the Find First icon, and inspect the highlighted text in the pane in the Test tab, as shown inFigure 8-13
4. Look for other matches by clicking the Find Next icon several times Note which occurrences ofSQL Servermatch or don’t match
Trang 9How to Match Positions
By combining lookahead and lookbehind, it is possible to match positions between characters For example,suppose that you wanted to match a position immediately before the Andrewof the following sample text:
Trang 10You could state the problem definition as follows:
Match a position that is preceded by the character sequence is followed by a space character and is followed by the character sequence Andrew
You could match that position using the following pattern:
(?<=is )(?=Andrew)
Try It Out Matching a Position
1. Open RegexBuddy On the Match tab, type the regular expression pattern (?<=is )(?=Andrew).
If you used RegexBuddy for the replace example earlier in this chapter, delete the replacementtext on the Replace tab
2. On the Test tab, enter the sample text This is Andrews book.
3. Click the Find First icon, and inspect the information in the lower pane of the Test tab, as shown
in Figure 8-14 On-screen, you can see the cursor blinking at the position immediately before theinitial Aof Andrews
Figure 8-14
Trang 11How It Works
The regular expression engine starts at the beginning of the document and tests each position to seewhether both the lookbehind and lookahead constraints are satisfied In the test text, only the positionimmediately before the initial Aof Andrewssatisfies both constraints It is, therefore, the only positionthat matches
Adding Commas to Large Numbers
One of the useful ways to apply a combination of lookbehind and lookahead is adding commas to largenumbers
Assume that the sales for the fictional Star Training Company are $1,234,567 The data would likely bestored as an integer without any commas However, for readability, commas are usual in many situa-tions where financial or other numerical data is presented
The process of adding commas to a large numeric value is essentially to match the position between theappropriate numeric digits and replace that position by a comma
In some European languages, the thousands separator, which is a comma in English, is a period ter Such periods can be added to a numeric value by slightly modifying the technique presented below.
charac-First, let’s look at a numeric value of 1234and how you can add a comma in the appropriate place Youwant to insert the comma at the position between the 1and the 2 The reason to insert a comma in thatposition is that there are three numeric digits between the desired position and the end of the string.Try It Out Adding a Comma Separator to a Four-Digit Number
1. Open RegexBuddy On the Replace tab, enter the pattern (?<=\d)(?=\d\d\d) in the upper pane
and a single comma character in the lower pane
2. On the Test pane, click the Find First icon Confirm that there is a match, as described in thelower pane on the Test tab
3. Click the Replace All icon, and check the replacement text shown in the lower pane on the Testtab (see Figure 8-15) The replacement text is 1,234, which is what you want The regularexpression pattern works for four-digit numbers
Trang 12Figure 8-15
4. Edit the test text in the upper pane of the Test tab to read 1234567
5. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab.The replacement text is 1,2,3,4,567, which is not what you want All the positions that have
at least three numeric digits to the right have had a comma inserted, as shown in Figure 8-16
6. Edit the pattern to (?<=\d)(?=(\d\d\d)+)
7. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab.The undesired commas are still there
Trang 13Figure 8-16
8. Edit the pattern to (?<=\d)(?=(\d\d\d)+$)
9. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab (seeFigure 8-17) This is 1,234,567, which is what you want
10. Depending on your data source, the pattern (?<=\d)(?=(\d\d\d)+$)may not work Imagine
if a single character — for example, a period character — follows the last digit of the number towhich you wish to add commas Edit the test text to read Monthly sales figures are1234567
11. Edit the regular expression on the Replace tab to read (?<=\d)(?=(\d\d\d)+\W)
Trang 14Figure 8-17
How It Works
The pattern (?<=\d)(?=\d\d\d)looks for a position that follows a single numeric digit and precedesthree numeric digits In the sample text 1234, there is only one position that satisfies both the look-behind and lookahead constraints: the position after the numeric digit 1
When the test text is changed to 1234567, the pattern (?<=\d)(?=\d\d\d)matches several times Forexample, the position following the numeric digit 2is preceded by a numeric digit and is followed bythree numeric digits That position therefore satisfies both the lookbehind and lookahead constraints.You need to group the numeric digits into groups of three to attempt to get rid of the undesired commareplacements The pattern (?<=\d)(?=(\d\d\d)+)groups the numeric digits in the lookahead intothrees but fails, as you saw in Figure 8-16, to prevent the unwanted commas At the position following the numeric digit 2, there is still a sequence of three digits following that position, so the position matches
A comma is therefore inserted (although that is not appropriate to formatting norms for numbers)
Trang 15When the pattern is edited to (?<=\d)(?=(\d\d\d)+$), you get the results you want The position lowing the numeric digit 2now fails to satisfy the lookahead constraint It is followed by five numericdigits, which does not match the pattern (\d\d\d)+.
fol-However, the position after the numeric digit 1still matches It is followed by six numeric digits, whichmatches the pattern (\d\d\d)+ Similarly, the position after the numeric digit 4is matched, because it isfollowed by three numeric digits, which matches the pattern (\d\d\d)+ In both those positions thatmatch, a comma is inserted
Trang 16Conversely, matching and manipulating undesired data may well corrupt parts of your data.Whether that data corruption leads to minor typos or more serious problems depends on yourdata, what its intended use is, and the extent and severity of the undesired changes you uninten-tionally make to it Again, the undesired effects can impact adversely on customer satisfaction Sosensitivity and specificity are issues to take seriously.
In this chapter, you will learn the following:
❑ What sensitivity and specificity are
❑ How to work out how far you should go in investing time and effort in maximizing tivity and/or specificity
sensi-❑ How to use regular expression techniques to give an optimal balance of sensitivity andspecificity
❑ How the detail of the data source can affect sensitivity and specificity
❑ How to gain a better balance of sensitivity and specificity in the Star Training Companyexample
Trang 17What Are Sensitivity and Specificity?
Sensitivity is the capacity to match the pattern that you want to match Specificity is the capacity to limit
the character sequences selected by a pattern to those character sequences that you want to detect
The definitions given may feel a little abstract, so the following examples are provided to develop aclearer understanding of the ideas of sensitivity and specificity
Extreme Sensitivity, Awful Specificity
Suppose that you want to match the character sequence ABC It is very easy to achieve 100 percent tivity using the following pattern:
sensi-.*
It selects sequences of zero or more alphanumeric characters
A sample document, ABitOfEverything.txt, is shown here:
This is a random 58#Gooede garbled piece of 8983ju**nk but it is still selected
Sensitivity and specificity are terms derived from quantitative disciplines such as
statistics and epidemiology Broadly, sensitivity is a measure of the number of true
hits you find divided by the total number of true hits you ought to find if you match
all occurrences of the relevant character sequences, and specificity is the number of
hits you find that are true hits divided by the total number of hits you find The
higher the sensitivity, the closer you are, in the context of regular expressions, to
finding all true matches, and the higher the specificity, the closer you are to finding
only true matches.
Trang 18As you can see, there is a pretty diverse range of content, not all of which is useful However, if youapply the regular expression pattern *you achieve 100 percent sensitivity, because the only occurrence
of the character sequence ABCis matched However, you also select every other piece of text in the ple document, as you can see in Figure 9-1 in OpenOffice.org Writer
sam-Figure 9-1
I introduced this slightly silly example to make an important point It is possible to create very sensitiveregular expression patterns that achieve nothing useful Of course, you are unlikely to use *as a standalone pattern, but it is important to carefully consider the usefulness of the regular expression patterns you create when, typically, the issues will be significantly more subtle
Useful regular expressions keep the 100 percent sensitivity (or something very close to 100 percent) ofthe *pattern but combine it with a high level of specificity
Trang 19Email Addresses Example
Suppose that you have a large number of documents or an email mail file that you need to search forvalid email addresses The file EmailOrNotEmail.txtillustrates the kind of data that might be con-tained in the material you need to search The content of EmailOrNotEmail.txtis shown here:
Figure 9-2
Trang 20As the figure shows, all the valid email addresses (which are on lines 4, 5, 9, and 10) are selected Thisgives you 100 percent sensitivity, at least on this test data set In other words, you have selected everycharacter sequence that represents a valid email address But you have, on all the other lines, matchedcharacter sequences that are pretty obviously not email addresses You need to find a more specific pat-tern to improve the specificity of matching.
Look a little more carefully at how an email address is structured Broadly, an email address follows thisstructure:
of the email address The following pattern matches, at a minimum, a single alphabetic character due tothe \w+component of the pattern:
So you could use a lookbehind to allow a match for a period character only when it has been preceded
by at least one alphabetic character This pattern would allow matching of a period character only when
it is preceded by an alphabetic character:
\w*(?<=\w)\.?\w+
Try It Out Email Address
1. Open PowerGrep, and enter the pattern \w*(?<=\w)\.?\w+@.*in the Search text area
2. Enter the folder name C:\BRegExp\Ch09 in the Folder text box Amend, as appropriate, if youdownloaded the sample files to a different directory
3. Enter the filename EmailOrNotEmail.txt in the File Mask text box, and click the Search button.
4. Inspect the results in the Results area Compare the matches shown in Figure 9-2 with the matchesnow shown in Figure 9-3, particularly noting the character sequences that no longer match
Trang 21Figure 9-3
This is an improvement The pattern is more specific You no longer match the undesired character sequences on lines 1, 2, 7, and 8 However, the character sequence on Line 3,John@somewhere.invalid, is not a valid email address
You can remove that undesired match by making the hostname part of the email address morespecific How specific you want to be is a matter of judgment You know that all hostnames willhave a sequence of alphabetic characters, followed by a period character, followed by three(com, net, org, or biz) or four (info) alphabetic characters For the purposes of this example
we won’t consider hostnames like example.co.uk The following pattern would be an priate pattern to match hostnames that correspond to the structure just described:
appro-\w+\.\w{3,4}
The \w+will match even single character domain names (which are allowed with com, net,and orgdomains) The \.metacharacter matches a single period character, and the \w{3,4}component matches either three or four alphabetic characters
Combining that pattern with your earlier one gives you the following:
One way to approach this is to use a lookahead to specify that following the first match for an @character, another @character does not occur If you continue to assume that only alphabetic char-acters are allowed in an email address, you can specify that you look ahead from the first @charac-ter matched to the first match for a character that is not an alphabetic character or a periodcharacter
Trang 22You can do that using the following pattern:
\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}
7. Edit the pattern in the Search text area to be
\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}, and click the Search button
8. Inspect the results Figure 9-4 shows the appearance.
9. Modify the pattern in the Search area to be
^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$, and click the Search button
10. Inspect the results Figure 9-5 shows the appearance.
Figure 9-5
Trang 23Happily, you have now succeeded in avoiding matching the undesired matches on lines 3 and 6 At least
on this simple test data, you have achieved 100 percent sensitivity and 100 percent specificity
The terms sensitivity and specificity come from quantitative sciences, such as statistics and ogy In those contexts, both the sensitivity and specificity are expressed numerically, often as percent- ages So for the preceding example, you have a sensitivity of 100 percent because all true email addresses are detected using your first attempt at a regular expression pattern, and you initially have a specificity
epidemiol-of 40 percent because 6 epidemiol-of the 10 matches are false matches (in the sense that they are not valid email
addresses) By the end of the Try It Out example, the specificity has risen to 100 percent on the test data.
Replacing Hyphens Example
This example looks at another problem that can occur if you are not careful in thinking through themeaning of a regular expression
Assume that you have a collection of text documents that have to be converted into HTML/XHTML.This example focuses on the possible need for replacing a line of hyphens with the HTML/XHTML
<hr>element to create a horizontal ruled line
A simplified sample document, HyphenTest.txt, is used in this example:
A first attempt at expressing the problem definition might be as follows:
Replace any hyphens that occur with the character sequence <hr>
However, that is too imprecise For example, the third line would be replaced with the following:
<hr><hr><hr><hr>
A more precise statement of the problem definition would be as follows:
Replace any group of consecutive hyphens with the character sequence <hr>
Assume that you will omit the end tag of the hrelement, because many Web browsers have problems ifyou use the empty element tag, <hr/>
If you use the following regular expression pattern to express the idea of one or more hyphens, you canrun into problems for two reasons:
-*
Trang 24First, not all regular expression engines interpret that pattern correctly The pattern -*means “Match
zero or more hyphens,” which means that the occurrence of zero hyphens is a match Therefore, the text
Fredought to match, which may not be what you expected Why does Fredmatch? Because there arezero hyphens
OpenOffice.org Writer implements the -*pattern as you might intuitively expect, because it matchesonly when at least one hyphen occurs, as shown in Figure 9-6, when it ought to match on each linebecause each line has zero hyphens at the beginning
Trang 25Figure 9-7
The Sensitivity/Specificity Trade-Off
Sensitivity and specificity are always part of a trade-off Sensitivity and specificity are components of thetrade-off, but the amount of effort required to get 100 percent sensitivity and 100 percent specificity maynot be practical in some situations Some undefined “good” specificity may be enough It’s a trade-off inthat, in the end, only you can judge how much effort is appropriate for the task that you are using regu-lar expressions to achieve
How important are sensitivity and specificity? The answer is, “It depends.” There are many times whenyou will need high sensitivity, 100 percent sensitivity ideally, and at the same time you also need highspecificity At other times, one or the other may be less important This section looks at some of the fac-tors that influence how much importance it is relevant to place on sensitivity and specificity
It depends to a significant extent on who the customer is If you are using regular expressions to achievesomething for your own use, you may not worry too much if you miss one or two matches On the otherhand, if you are conducting a replacement of every occurrence of a company name after a takeover, forexample, it would be serious if sensitivity fell below 100 percent
How Metacharacters Affect Sensitivity and Specificity
In general, the more metacharacters you use, the more specific a pattern becomes The pattern catmatches that sequence of characters whether they refer to a feline mammal or form character sequences
Trang 26Adding further metacharacters, such as the \bword boundary, makes the use of the character sequencecatin a pattern much more specific The pattern \bcat\bwill match only the word cat(singular).When using specific patterns like that, you need to watch carefully for the possibility of reducing sensi-tivity The pattern \bcat\bwill match catbut won’t match cats, for example If you are interested infinding all references in the document to feline mammals, the \bcat\bpattern may not be the bestoption You may want to allow for the occurrence of the plural form, cats, and the possessive form,cat’s, too The pattern \bcat’?s?’?\bwould match cat, cats, cats’(plural possessive), and cat’s(singular possessive) but would also match cat’, which is unlikely to be a desired match If your data isunlikely to contain the character sequence cat’, the pattern \bcat’?s?’?\bmay be sufficient But if,for some reason, you want to match only cat, cats, cat’s, and cats’, some other, more specific pat-tern will be needed One simple option is as follows:
(cat|cats|cat’s|cats’)
An alternative follows:
ca(t|ts|t’s|ts)Similar issues apply whatever the word or sequence of characters of interest
Sensitivity, Specificity, and Positional Characters
The positional characters explored in Chapter 6 can be expected in many cases to affect both sensitivityand specificity
In the following example, an initial version of the problem definition can be expressed as follows:
Match all occurrences of the sequence of characters t , h , and e case insensitively.
The pattern thewill match twice in the following text:
Paris in the the spring
It will match once in the following text:
The spring has sprung
However, suppose you modify the problem definition to the following:
Match the position at the beginning of a string; then match the sequence of characters t , h , and e case insensitively.
The pattern ^thenow has no match in the first sample text but still has a single match in the second.The effect of adding one or more positional metacharacters depends on the data the pattern is beingmatched against
Trang 27Sensitivity, Specificity, and Modes
When you specify that a regular expression is to be executed in a case-insensitive or case-sensitive mode,you affect the matches that will be returned Continuing with the preceding example, the pattern ^theapplied case sensitively has no match in either of the two test pieces of text In the second sample text,the ^metacharacter matches the position at the beginning of the string, but the lowercase tof the pat-tern does not match the uppercase Tof the test text
Similarly, the use of the period metacharacter (which matches a large range of characters) can be
switched to match or not match a newline character
Sensitivity, Specificity, and Lookahead and Lookbehind
When you add lookahead or lookbehind to an existing regular expression, you may have no effect onsensitivity ,or you may adversely impact it Equally, you may improve specificity or, less likely, it maystay the same
If a lookbehind is carefully crafted, it won’t reduce sensitivity However, if you make an error in the tern inside the lookbehind, you will fail to match when you intended to match, reducing sensitivity.Suppose that you wanted to find information about Anne Smith The following pattern would matchwhen the spelling of Anneis correct, and it is followed by exactly one space character:
Similarly, lookahead can reduce sensitivity For example, suppose that you want to match all occurrences
of the character sequence John The following pattern would match a word boundary, then the desiredcharacter sequence John, and then check if the following character is a space character:
\bJohn(?= )
However, if the test text is as follows, the lookahead is too specific and causes what is likely to be adesired match to fail:
I went with John, and Mary on a trip
Modifying the lookahead to (?=\b)or (?=\W)would prevent the problem caused by the occurrence of
an unanticipated comma
How Much Should the Regular Expressions Do?
Trang 28expressions as a developer, you will typically be using regular expressions inside code written in Java,JavaScript, VB.NET, and so on, or you may be applying regular expressions to data retrieved from a rela-tional database So how much should you expect the regular expressions to do, and how much can yousafely assume that your other code or the error checking in a database already does?
For example, suppose you have a collection of HTML documents that include IP addresses, and yourtask is to amend the style that the IP addresses are displayed in Suppose that initially, IP addresses arenested inside the start and end tags for HTMLbelements, as in the following:
<b>1.12.123.234</b>
What pattern should you use to find such IP addresses? Should you just assume that the data youreceive will be correctly formed (including having no values of 256or more), or should you include amore complex pattern so that the regular expression will match only correctly formed IP addresses?
If you assume that the IP addresses are already correctly formed or are checked by some other part ofyour code, you could use a fairly simple pattern such as the following:
Knowing the Data, Sensitivity, and Specificity
One of the key issues that affect how well you achieve sensitivity and specificity is how well you stand the data to which you are applying regular expressions Of course, your understanding of the regular expression syntax and techniques supported by your chosen language or tool is important, too
Trang 29under-But if you don’t really understand the data you are working with, even a regular expression with correctsyntax can turn up unexpected results, by lowering either sensitivity or specificity.
Abbreviations
Abbreviations can pose significant potential for lowering the sensitivity of a regular expression Forexample, titles such as Dr(with no period character) and Dr.(with a period character) are frequentlyused as abbreviations for Doctor In some circumstances, you may be confident that only one form isused in the data source If all three forms occur in the data, a pattern like the following will be necessary
to avoid missing some desired matches:
(Doctor|Dr.|Dr)
Similar issues arise when handling data that includes information about qualifications For example, if
a Doctor of Philosophydegree is of interest, it will often be written as PhD(no space character orperiod character), Ph.D.(two period characters), or Ph D.(one space character, two period characters)
To match the options just mentioned, a pattern such as the following would be satisfactory:
Ph\ ?D\.?
It includes the \.metacharacter twice with a ?quantifier, which matches each of the optional periodcharacter(s) that can occur in some of the options Depending on where the degree was obtained, theform D.Phil.(two period characters) with option DPhil(no period characters) can also occur To allowfor these additional forms, a pattern such as the following would be needed:
(Ph\.?D\.?|D\.?Phil\.?)
Characters from Other Languages
The focus of this book is the use of regular expressions with English, including U.S English and BritishEnglish However, with the increasing globalization of trade, the inclusion of words and characters fromother languages commonly occurs in documents that are, for the most part, written in English
In Canada, many official documents are in French Therefore, many characters with accents will be tinely encountered
rou-In documents written in English, there can be differences in how words are written For example, the testtext
“Nostalgia is not what it used to be.” That is my favorite cliche
might equally have been written as follows:
“Nostalgia is not what it used to be.” That is my favorite cliché
Trang 30The second version includes the acute character éjust before the period, which concludes the sentence.
To match both forms, you would need to use a pattern such as the following:
clich(e|é)Foreign characters introduce other issues when they occur in HTML The sample document,EAcute.html, uses the notation éinstead of the literal character:
<h2>”Nostalgia is not what it used to be.” That is my favorite cliché.</h2>Yet as you can see in Figure 9-8, the correct character is displayed on the Web page
lan-Sa?ura(v|bh)Similar considerations apply in other foreign names The Russian name for Peter, sometimes transliter-ated as Pyotr, may also be found spelled as Petror Pëtr, or even translated as Peter, and may need to
be matched in all instances To match all these possible forms of the name, you might use a pattern likethis:
P(yo|e|ë)tr
Trang 31Some European surnames have variant spellings too For example, the surnames Van Nistelrooy(with an intermediate space character) can also be spelled Van Nistelrooijor VanNistelrooy(with
no intermediate space character) So a pattern such as the following would be needed to match thesethree spelling variants:
Van *Nistelroo(ij|y)
Of course, because some such surnames may sometimes be spelled with a lowercase vin van, the lowing pattern might be more sensitive in some situations:
fol-[vV]an *Nistelroo(ij|y)
Sensitivity and How to Achieve It
To achieve maximum sensitivity, you must be aware of all the variant character sequences that can beused to express the character sequence that you want to match
Each time you add some component to a pattern that makes it more specific, you need to carefully sider whether, given the data you are working with, it might also cause some desired matches to fail
con-Specificity and How to Maximize It
Conceptually, the way to maximize specificity is to make the regular expression as specific as possible.There are many techniques to cut out unwanted matches, several of which have been discussed earlier inthis chapter
When attempting to maximize specificity, it is important to give careful consideration to situations thatyou don’t want to match and constructing a pattern that excludes those unwanted character sequencesfrom matching Achieving high specificity involves having an understanding of regular expression syn-tax and the effects of the techniques available to you, and understanding how those techniques affect thedata you are working with
Revisiting the Star Training Company
Example
In Chapter 1, you looked at an example that posed a challenge to a new recruit to the fictional StarTraining Company Having learned a range of techniques in Chapters 2 through 7, you are now in amuch better position to avoid many of the pitfalls that occurred when a simple find and replace wasattempted in Chapter 1
For convenience, the sample text, StarOriginal.txt, is reproduced here:
Trang 32Star Training CompanyStarting from May 1st Star Training Company is offering a startling special offer
to our regular customers - a 20% discount when 4 or more staff attend a single StarTraining Company course
In addition, each quarter our star customer will receive a voucher for a freeholiday away from the pressures of the office Staring at a computer screen all daymight be replaced by starfish and swimming in the Seychelles
Once this offer has started and you hear about other Star Training customersenjoying their free holiday you might feel left out Don’t be left on the outsidestaring in Start right now building your points to allow you to start out on yourvery own Star Training holiday
Reach for the star Training is valuable in its own right but the possibility of afree holiday adds a startling new dimension to the benefits of Star Trainingtraining
Don’t stare at that computer screen any longer Start now with Star Training iscrucial to your company’s wellbeing Think Star
The problem definition can be expressed as follows:
Match all occurrences of the character sequence S , t , a , and r when that character sequence refers to the Star Training Company Replace each occurrence of the preceding character sequence with the character sequence M , o , o , and n
The objective is to replace all references to the fictional Star Training Company with corresponding ences to the equally fictional Moon Training Company
refer-When faced with a task like this in real life, it can be helpful to view a few sample documents in a texteditor or word processor with search facilities That allows you to enter a pattern to look for occurrences
of character sequences that might be relevant In this case, you can use the simple literal pattern star(all lowercase) and use regular expressions matching in a case-insensitive way
Try It Out Replacing Star with Moon
1. Open the file StarOriginal.txtin OpenOffice.org Writer
2. Open the Find & Replace dialog box using Ctrl+F.
3. Check the Regular Expressions check box, but leave the Match Case check box unchecked,because you want to find all occurrences of the specified pattern in a case-insensitive way
4. Type the pattern star in the Search For text box, and click the Find All button.
5. Inspect the matches shown in Figure 9-9, paying careful attention to any occurrences of thecharacter sequence starthat refer to the Star Training Company
Trang 33multi-Let’s take time to list the character sequences that you want to match You want to match starin the following:
Star Training
Star
Trang 34You want to avoid matching starin the following character sequences:
Startingstartlingstar customerStaringstarfishstartedStart rightstart outstar
startlingstareStart now
I don’t routinely take time to lay out desired matches in a list and undesired matches in a second list Butparticularly when you need to get things as close to 100 percent sensitivity and 100 percent specificity aspossible, it makes a lot of sense to make lists like this
Splitting character sequences into desired matches and undesired matches can be really helpful in ing out how sensitive and specific any pattern will prove to be
work-If you decide that a lookahead is the way to proceed (as it probably is), you could try to match alldesired matches using the following pattern:
Star(?= Training)However, if you look at the list of desired matches, you can see immediately that the preceding patternwill fail in a sentence such as Think Star.That’s one of the occurrences of Starfollowed by a periodcharacter
The following pattern, which offers alternation of two lookaheads, fits all the desired matches that youhave seen in the sample text:
Star((?= Training)|(?=\.))Thus, as judged by the sample text, you have 100 percent sensitivity Figure 9-10 shows the precedingpattern being tested against the character sequence Star
It is always wise to consider that the test data you have looked at doesn’t hold all the likely or possiblecharacter sequences that you need to think about One of the exercises in this chapter asks you to modifythe preceding pattern to allow for other possible occurrences that might be relevant to the uses of Starthat are of interest
The patterns that you want to match are, in general, different from those that you want not to match So
it is generally straightforward to be sure that the pattern does not match any of the undesired charactersequences, with one exception: You want to match the five-character sequence of characters Star.(with
an initial uppercase S) but not match the five-character sequence of characters star.(with an initiallowercase s)
Trang 35If you use the preceding pattern in matching that is case sensitive, there is no problem The undesiredcharacter sequence star.does not match However, if the matching is case insensitive, the undesiredcharacter sequence star.will match, lowering the specificity of the chosen pattern.
Figure 9-10
Exercises
Test your understanding of the material in this chapter using the following exercises:
1. Modify the pattern ^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$, which was oped earlier in the chapter for matching email addresses, so that it matches only hostnames inthe com, net, and orgdomains
devel-2. Modify the pattern in the Star Training Company example to match the character sequencestarso that it also matches in data like the following:
What do you think of Star?
The best training company is Star!
Trang 36a project over a period of time, you come face to face with a third truism: Regular expressions arehard to maintain The purpose of this chapter is to help you take steps to minimize the effects ofthese three truisms.
A basic consideration that is important not to forget is that regular expressions never occur in lation They are always used to work on data, whether simple or extensive, and are used in thecontext of a tool or a programming language In addition, the developer has a specific purpose,sometimes a complex or subtle business purpose, for the regular expressions that he writes.The problems that arise when using regular expressions can be due simply to being unable towrite patterns that express the matching characteristics that you want Ideally, as you workthrough this book, that problem will become less and less common
iso-In this chapter, you will learn the following:
❑ How to document regular expressions
❑ How to explore the data you are working with
❑ How to create test cases for regular expressions
❑ How to debug regular expressions
Trang 37Documenting Regular Expressions
Any programming project of significant size can benefit from good documentation It makes the purpose
of many aspects of the project clear and can assist in further development of the code at a future date.Given the compact, cryptic nature of regular expression syntax, it makes good sense seriously to con-sider documenting your approach to the creation of a particular regular expression and what you expectthe parts of the regular expression to do
In many circumstances, your use of regular expressions may be on a very small scale, where it is ing to avoid any documentation Sometimes, no documentation is the only sensible approach For exam-ple, in some situations, such as using regular expressions in Microsoft Word or OpenOffice.org Writer,documenting a regular expression is overkill You want to find or replace a character sequence there andthen in a single document Formal documentation is unnecessary
tempt-However, in more significant tasks or projects, creating documentation can be a useful discipline, serving to make explicit aspects of the task that you might otherwise be tempted to allow to remainambiguous
Document the Problem Definition
The problem definition is a key component in recording your thought process while designing a regularexpression pattern As mentioned in earlier chapters, you may well not get the problem definition suffi-ciently precise the first time round If the problem is a complex one, it may be worth recording a problemdefinition that isn’t what you want so that if you come back to the code in a few months’ time, you will
be reminded of the work you needed to do while designing the regular expression pattern
A first attempt at a problem definition might be very nonspecific or expressed in a way that doesn’timmediately allow definition of a pattern to match what it is hoped to do
A first attempt at a problem definition to solve the Star Training Company problem in Chapter 1 might
be as follows:
Replace Star with Moon
A brute-force search and replace can cause a substantial number of inappropriate changes If you madesuch inappropriate changes across a large number of documents in the absence of recent backups, itcould take a considerable amount of time to rectify the problems that poor use of a literal regular expres-sion caused
Refining a problem definition depends on an understanding of the data You might have text like the following:
Star Training Company
I highly recommend Star
Why not accept this special offer from Star?
recent course with Star - which was great!
Trang 38You can see the different ways in which desired matches can be expressed You must understand thedata to be able to construct a pattern that will match (and then replace) all of these.
On the other hand, there may be text that contains similar text, which is text that you want to leavealone:
The trainer was good - a real star!
The training was excellent - star training
Star performer among the trainers
Again, if you don’t take time to understand undesired possible matches, you may end up making propriate changes to the documents you are working with
inap-Add Comments to Your Code
Adding comments to your code is a basic task Try to make comments as meaningful as possible, and try
to make them express what the pattern you create is expected to do
Comments such as the following are pretty useless, particularly when you come back to the code to findout why it isn’t doing that:
// This will replace Star with MoonMake the comments meaningful, such as in the following example:
// This matches Star case sensitively, avoiding words like start and star//It matches when Star is followed by a space character and the character sequenceTraining
//or followed by a period (full stop)//or followed by a question markComments like these give a much clearer idea of what was intended and should correspond prettyclosely to components of the regular expression pattern
If you make a false start of some kind in attempting to solve a problem, it can also be useful to include acomment about what doesn’t work and why While it can be embarrassing to admit a mistake in yourthinking, being upfront about the problem is better than wasting time a few weeks later by going downthe same blind alley
Making Use of Extended Mode
When I write code in JavaScript, Java, Visual Basic NET, and various other programming languages Ispace the components of the code out and indent nested components so that the structure of the code iseasily discerned I would never consider jamming sizeable chunks of code onto a single line if it was
Trang 39avoidable, because that is much harder to read Making code readable and adding comments where theyare most relevant make the coding and maintenance experience a much smoother one.
One of the key advantages of comments on ordinary code is that you can place the comments right next
to the component of the code to which the comments relate It’s far less useful to have comments that are
a screen or two away from the code to which they refer A similar problem can occur in many regularexpression implementations, where you simply cannot put the comments adjacent to the code that theyrefer to
Extended mode is available in languages such as Perl, Java, and PHP It allows you to include comments
on the same line as the pattern component that they describe Keeping a piece of code right next to itsdescription helps cut down on occurrences of misunderstanding code
Extended mode in Perl is indicated by the xmodifier following the second forward slash of the m//operator
To match input from two known users, you could use a simple program such as JimOrFred.pl:
#!/usr/bin/perl -w
use strict;
print “This program will say ‘Hello’ to Jim or Fred.\n”;
my $myPattern = “^(Jim|Fred)\$”;
# The pattern matches only ‘Jim’ or ‘Fred’ Nothing else is allowed
print “Enter your first name here: “;