HandBooks Professional Java-C-Scrip-SQL part 60 doc

In the format string, it's possible to refer to subexpression matches, and that's precisely what we need here.. The expression $N, where N is the index of a subexpression, expands to the

Trang 1

Calling simply new results in a leak!";

if (boost::regex_search(s,m,reg)) {

// Did new match?

if (m[1].matched)

std::cout << "The expression (new) matched!\n";

if (m[2].matched)

std::cout << "The expression (delete) matched!\n";

}

The preceding program searches the input string for new or delete, and reports which one it finds first By passing an object of type smatch to

regex_search, we gain access to the details of how the algorithm succeeded In our expression, there are two subexpressions, and we can thus get to the

subexpression for new by the index 1 of match_results We then hold an instance of sub_match, which contains a Boolean member, matched, that tells

us whether the subexpression participated in the match So, given the preceding input, running this code would output "The expression (new) matched!\n" Now, you still have some more work to do You need to continue applying the regular expression to the remainder of the input, and to do that, you use another overload

of regex_search, which accepts two iterators denoting the character sequence

to search Because std::string is a container, it provides iterators Now, for each match, you must update the iterator denoting the beginning of the range to refer to the end of the previous match Finally, add two variables to hold the counts for new and delete Here's the complete program:

#include <iostream>

#include <string>

#include "boost/regex.hpp"

int main() {

// Are there equally many occurrences of

// "new" and "delete"?

boost::regex reg("(new)|(delete)");

boost::smatch m;

std::string s=

"Calls to new must be followed by delete \

Calling simply new results in a leak!";

int new_counter=0;

int delete_counter=0;

std::string::const_iterator it=s.begin();

Trang 2

std::string::const_iterator end=s.end();

while (boost::regex_search(it,end,m,reg)) {

// New or delete?

m[1].matched ? ++new_counter : ++delete_counter;

it=m[0].second;

}

if (new_counter!=delete_counter)

std::cout << "Leak detected!\n";

else

std::cout << "Seems ok \n";

}

Note that the program always sets the iterator it to m[0].second

match_results[0] returns a reference to the submatch that matched the whole regular expression, so we can be sure that the end of that match is always the

correct location to start the next run of regex_search Running this program outputs "Leak detected!", because there are two occurrences of new, and only one

of delete Of course, one variable could be deleted twice, there could be calls to new[] and delete[], and so forth

By now, you should have a good understanding of how subexpression grouping works It's time to move on to the final algorithm in Boost.Regex, one that is used

to perform substitutions

Replacing

The third in the family of Regex algorithms is regex_replace As the name implies, it's used to perform text substitutions It searches through the input data, finding all matches to the regular expression For each match of the expression, the algorithm calls match_results::format and outputs the result to an output iterator that is passed to the function

In the introduction to this chapter, I gave you the example of changing the British spelling of colour to the U.S spelling of color Changing the spelling without using regular expressions is very tedious, and extremely error prone The problem is that there might be different capitalization, and a lot of words that are affectedfor

example, colourize To properly attack this problem, we need to split the regular expression into three subexpressions

boost::regex reg("(Colo)(u)(r)",

Trang 3

boost::regex::icase|boost::regex::perl);

We have isolated the villainthe letter uin order to surgically remove it from any matches Also note that this regex is case-insensitive, which we achieve by passing the format flag boost::regex::icase to the constructor of regex Note that you must also pass any other flags that you want to be in effect A common user error when setting format flags is to omit the ones that regex turns on by default, but that don't workyou must always apply all of the flags that should be set

When calling regex_replace, we are expected to provide a format string as an argument This format string determines how the substitution will work In the format string, it's possible to refer to subexpression matches, and that's precisely what we need here You want to keep the first matched subexpression, and the third, but let the second (u), silently disappear The expression $N, where N is the index of a subexpression, expands to the match for that subexpression So our format string becomes "$1$3", which means that the replacement text is the result of the first and the third subexpressions By referring to the subexpression matches, we are able to retain any capitalization in the matched text, which would not be possible if we were to use a string literal as the replacement text Here's a complete program that solves the problem

#include <iostream>

#include <string>

#include "boost/regex.hpp"

int main() {

boost::regex reg("(Colo)(u)(r)",

boost::regex::icase|boost::regex::perl);

std::string s="Colour, colours, color, colourize";

s=boost::regex_replace(s,reg,"$1$3");

std::cout << s;

}

The output of running this program is "Color, colors, color,

colorize" regex_replace is enormously useful for applying substitutions like this

Trang 4

A Common User Misunderstanding

One of the most common questions that I see related to Boost.Regex is related to the semantics of regex_match It's easy to forget that all of the input to

regex_match must match the regular expression Thus, users often think that code like the following should yield true

boost::regex reg("\\d*");

bool b=boost::regex_match("17 is prime",reg);

Rest assured that this call never results in a successful match All of the input must

be consumed for regex_match to return TRue! Almost all of the users asking why this doesn't work should use regex_search rather than regex_match

boost::regex reg("\\d*");

bool b=boost::regex_search("17 is prime",reg);

This most definitely yields TRue It is worth noting that it's possible to make

regex_search behave like regex_match, using special buffer operators \A matches the start of a buffer, and \Z matches the end of a buffer, so if you put \A first in your regular expression, and \Z last, you'll make regex_search behave exactly like regex_matchthat is, it must consume all input for a successful match The following regular expression always requires that the input be

exhausted, regardless of whether you are using regex_match or

regex_search

boost::regex reg("\\A\\d*\\Z");

Please understand that this does not imply that regex_match should not be used; on the contrary, it should be a clear indication that the semantics we just talked aboutthat all of the input must be consumedare in effect

About Repeats and Greed

Another common source of confusion is the greediness of repeats Some of the repeatsfor example, + and *are greedy This means that they will consume as much of the input as they possibly can It's not uncommon to see regular

expressions such as the following, with the intent of capturing a digit after a greedy repeat is applied

Trang 5

boost::regex reg("(.*)(\\d{2})");

This regular expression succeeds, but it might not match the subexpressions that you think it should! The expression * happily eats everything that following subexpressions don't match Here's a sample program that exhibits this behavior:

int main() {

boost::regex reg("(.*)(\\d{2})");

boost::cmatch m;

const char* text = "Note that I'm 31 years old, not 32.";

if(boost::regex_search(text,m, reg)) {

if (m[1].matched)

std::cout << "(.*) matched: " << m[1].str() << '\n';

if (m[2].matched)

std::cout << "Found the age: " << m[2] << '\n';

}

In this program, we are using another parameterization of match_results, tHRough the type cmatch It is a typedef for match_results<const char*>, and the reason we must use it rather than the type smatch we've been using before is that we're now calling regex_search with a string literal rather than an object of type std::string What do you expect the output of running this program to be? Typically, users new to regular expressions first think that both m[1].matched and m[2].matched will be TRue, and that the result of the second subexpression will be "31" Next, after realizing the effects of greedy repeatsthat they consume as much input as possiblethey tend to think that only the first subexpression can be TRuethat is, the * has successfully eaten all of the input Finally, new users come to the conclusion that the expression will match both subexpressions, but that the second expression will match the last possible sequence Here, that means that the first subexpression will match "Note that I'm 31 years old, not" and the second will match "32"

So, what do you do when you actually want is to use a repeat and the first

occurrence of another subexpression? Use non-greedy repeats By appending ? to the repeat, it becomes non-greedy This means that the expression tries to find the shortest possible match that doesn't prevent the rest of the expression from

matching So, to make the previous regex work correctly, we need to update it like

so

Trang 6

boost::regex reg("(.*?)(\\d{2})");

If we change the program to use this regular expression, both m[1].matched and m[2].matched will still be true The expression *? consumes as little of the input as it can, which means that it stops at the first character 3, because that's what the expression needs in order to successfully match Thus, the first

subexpression matches "Note that I'm" and the second matches "31"

Định dạng
Số trang	6
Dung lượng	23,46 KB