HandBooks Professional Java-C-Scrip-SQL part 61 pptx

This iterator type enumerates all of the regular expression matches in a sequence.. When constructing a regex_iterator, you pass to it the iterators denoting the input sequence, and the

Trang 1

A Look at regex_iterator

We have seen how to use several calls to regex_search in order to process all

of an input sequence, but there's another, more elegant way of doing that, using a regex_iterator This iterator type enumerates all of the regular expression matches in a sequence Dereferencing a regex_iterator yields a reference to

an instance of match_results When constructing a regex_iterator, you pass to it the iterators denoting the input sequence, and the regular expression to apply Let's look at an example where we have input data that is a

comma-separated list of integers The regular expression is simple

boost::regex reg("(\\d+),?");

Adding the repeat ? (match zero or one times) to the end of the regular expression ensures that the last digit will be successfully parsed, even if the input sequence does not end with a comma Further, we are using another repeat, + This repeat ensures that the expression matches one or more times Now, rather than doing multiple calls to regex_search, we create a regex_iterator, call the

algorithm for_each, and supply it with a function object to call with the result of dereferencing the iterator Here's a function object that accepts any form of

match_results due to its parameterized function call operator All work it performs is to add the value of the current match to a total (in our regular

expression, the first subexpression is the one we're interested in)

class regex_callback {

int sum_;

public:

regex_callback() : sum_(0) {}

template <typename T> void operator()(const T& what) {

sum_+=atoi(what[1].str().c_str());

}

int sum() const {

return sum_;

}

};

You now pass an instance of this function object to std::for_each, which results in an invocation of the function call operator for every dereference of the iterator itthat is, it is invoked every time there is a match of a subexpression in

Trang 2

the regex

int main() {

boost::regex reg("(\\d+),?");

std::string s="1,1,2,3,5,8,13,21";

boost::sregex_iterator it(s.begin(),s.end(),reg);

boost::sregex_iterator end;

regex_callback c;

int sum=for_each(it,end,c).sum();

}

As you can see, the past-the-end iterator passed to for_each is simply a default-constructed instance of regex_iterator Also, the type of it and end is boost::sregex_iterator, which is a typedef for

regex_iterator<std::string::const_iterator> Using

regex_iterator this way is a much cleaner way of matching multiple times than what we did previously, where we manually had to advance the starting

iterator and call regex_search in a loop

Splitting Strings with regex_token_iterator

Another iterator type, or to be more precise, an iterator adaptor, is

boost::regex_token_iterator It is similar to regex_iterator, but may also be employed to enumerate each character sequence that does not match the regular expression, which is useful for splitting strings It is also possible to select which subexpressions are of interest, so that when dereferencing the

regex_token_iterator, only the subexpressions that are "subscribed to" are returned Consider an application that receives input data where the entries are separated using a forward slash Anything in between constitutes an item that the application needs to process With regex_token_iterator, splitting the strings is easy The regular expression is very simple

boost::regex reg("/");

The regex matches the separator of items To use it for splitting the input, simply pass the special index 1 to the constructor of regex_token_iterator Here

is the complete program:

int main() {

boost::regex reg("/");

Trang 3

std::string s="Split/Values/Separated/By/Slashes,";

std::vector<std::string> vec;

boost::sregex_token_iterator it(s.begin(),s.end(),reg,-1);

boost::sregex_token_iterator end;

while (it!=end)

vec.push_back(*it++);

assert(vec.size()==std::count(s.begin(),s.end(),'/')+1);

assert(vec[0]=="Split");

}

Similar to regex_iterator, regex_token_iterator is a template class parameterized on the iterator type for the sequence it wraps Here, we're using sregex_token_iterator, which is a typedef for

regex_token_iterator<std::string::const_iterator> Each time the iterator it is dereferenced, it returns the current sub_match, and when the iterator is advanced, it tries to match the regular expression again These two iterator types, regex_iterator and regex_token_iterator, are very useful; you'll know that you need them when you are considering to call

regex_search multiple times!

More Regular Expressions

You have already seen quite a lot of regular expression syntax, but there's still more to know This section quickly demonstrates the uses of some of the

remaining functionality that is useful in your everyday regular expressions To begin, we will look at the whole set of repeats; we've already looked at *, +, and bounded repeats using {} There's one more repeat, and that's ? You may have noted that it is also used to declare non-greedy repeats, but by itself, it means that the expression must occur zero or one times It's also worth mentioning that the bounded repeats are very flexible; here are three different ways of using them:

boost::regex reg1("\\d{5}");

boost::regex reg2("\\d{2,4}");

boost::regex reg3("\\d{2,}");

The first regex matches exactly 5 digits The second matches 2, 3, or 4 digits The third matches 2 or more digits, without an upper limit

Another important regular expression feature is to use negated character classes using the metacharacter ^ You use it to form character classes that match any

Trang 4

character that is not part of the character class; the complement of the elements you list in the character class For example, consider this regular expression

boost::regex reg("[^13579]");

It contains a negated character class that matches any character that is not one of the odd numbers Take a look at the following short program, and try to figure out what the output will be

int main() {

boost::regex reg4("[^13579]");

std::string s="0123456789";

boost::sregex_iterator it(s.begin(),s.end(),reg4);

boost::sregex_iterator end;

while (it!=end)

std::cout << *it++;

}

Did you figure it out? The output is "02468"that is, all of the even numbers Note that this character class does not only match even numbershad the input string been

"AlfaBetaGamma," that would have matched just fine too

The metacharacter we've just seen, ^, serves another purpose too It is used to denote the beginning of a line The metacharacter $ denotes the end of a line

Bad Regular Expressions

A bad regular expression is one that doesn't conform with the rules that govern regexes For example, if you happen to forget a closing parenthesis, there's no way the regular expression engine can successfully compile the regular expression When that happens, an exception of type bad_expression is thrown As I mentioned before, this name will change in the next version of Boost.Regex, and in the version that's going to be added to the Library Technical Report The exception type bad_expression will be renamed to regex_error

If all of your regular expressions are hardcoded into your application, you may be safe from having to deal with bad expressions, but if you're accepting user input in the form of regexes, you must be prepared to handle errors Here's a program that prompts the user to enter a regular expression, followed by a string to be matched against the regex As always, when there's user input involved, there's a chance that

Trang 5

the input will be invalid

int main() {

std::cout << "Enter a regular expression:\n";

std::string s;

std::getline(std::cin, s);

try {

boost::regex reg(s);

std::cout << "Enter a string to be matched:\n";

std::getline(std::cin,s);

if (boost::regex_match(s,reg))

std::cout << "That's right!\n";

else

std::cout << "No, sorry, that doesn't match.\n";

}

catch(const boost::bad_expression& e) {

std::cout <<

"That's not a valid regular expression! (Error: " <<

e.what() << ") Exiting \n";

}

To protect the application and the user, a try/catch block ensures that if

boost::regex throws upon construction, an informative message will be printed, and the application will shut down gracefully Putting this program to the test, let's begin with some reasonable input

Enter a regular expression:

\d{5}

Enter a string to be matched:

12345

That's right!

Now, here's grief coming your way, in the form of a very poor attempt at a regular expression

Enter a regular expression:

(\w*))

That's not a valid regular expression! (Error: Unmatched ( or \() Exiting

Trang 6

An exception is thrown when the regex reg is constructed, because the regular expression cannot be compiled Consequently, the catch handler is invoked, and the program prints an error message and exits There are only three places where you need to be aware of potential exceptions being thrown One is when

constructing a regular expression, similar to the example you just saw; another is when assigning regular expressions to a regex, using the member function

assign Finally, the regex iterators and the algorithms can also throw

exceptionsif memory is exhausted or if the complexity of the match grows too quickly

Định dạng
Số trang	6
Dung lượng	24,39 KB