This iterator type enumerates all of the regular expression matches in a sequence.. When constructing a regex_iterator, you pass to it the iterators denoting the input sequence, and the
Trang 1A Look at regex_iterator
We have seen how to use several calls to regex_search in order to process all
of an input sequence, but there's another, more elegant way of doing that, using a regex_iterator This iterator type enumerates all of the regular expression matches in a sequence Dereferencing a regex_iterator yields a reference to
an instance of match_results When constructing a regex_iterator, you pass to it the iterators denoting the input sequence, and the regular expression to apply Let's look at an example where we have input data that is a
comma-separated list of integers The regular expression is simple
boost::regex reg("(\\d+),?");
Adding the repeat ? (match zero or one times) to the end of the regular expression ensures that the last digit will be successfully parsed, even if the input sequence does not end with a comma Further, we are using another repeat, + This repeat ensures that the expression matches one or more times Now, rather than doing multiple calls to regex_search, we create a regex_iterator, call the
algorithm for_each, and supply it with a function object to call with the result of dereferencing the iterator Here's a function object that accepts any form of
match_results due to its parameterized function call operator All work it performs is to add the value of the current match to a total (in our regular
expression, the first subexpression is the one we're interested in)
class regex_callback {
int sum_;
public:
regex_callback() : sum_(0) {}
template <typename T> void operator()(const T& what) {
sum_+=atoi(what[1].str().c_str());
}
int sum() const {
return sum_;
}
};
You now pass an instance of this function object to std::for_each, which results in an invocation of the function call operator for every dereference of the iterator itthat is, it is invoked every time there is a match of a subexpression in
Trang 2the regex
int main() {
boost::regex reg("(\\d+),?");
std::string s="1,1,2,3,5,8,13,21";
boost::sregex_iterator it(s.begin(),s.end(),reg);
boost::sregex_iterator end;
regex_callback c;
int sum=for_each(it,end,c).sum();
}
As you can see, the past-the-end iterator passed to for_each is simply a default-constructed instance of regex_iterator Also, the type of it and end is boost::sregex_iterator, which is a typedef for
regex_iterator<std::string::const_iterator> Using
regex_iterator this way is a much cleaner way of matching multiple times than what we did previously, where we manually had to advance the starting
iterator and call regex_search in a loop
Splitting Strings with regex_token_iterator
Another iterator type, or to be more precise, an iterator adaptor, is
boost::regex_token_iterator It is similar to regex_iterator, but may also be employed to enumerate each character sequence that does not match the regular expression, which is useful for splitting strings It is also possible to select which subexpressions are of interest, so that when dereferencing the
regex_token_iterator, only the subexpressions that are "subscribed to" are returned Consider an application that receives input data where the entries are separated using a forward slash Anything in between constitutes an item that the application needs to process With regex_token_iterator, splitting the strings is easy The regular expression is very simple
boost::regex reg("/");
The regex matches the separator of items To use it for splitting the input, simply pass the special index 1 to the constructor of regex_token_iterator Here
is the complete program:
int main() {
boost::regex reg("/");
Trang 3std::string s="Split/Values/Separated/By/Slashes,";
std::vector<std::string> vec;
boost::sregex_token_iterator it(s.begin(),s.end(),reg,-1);
boost::sregex_token_iterator end;
while (it!=end)
vec.push_back(*it++);
assert(vec.size()==std::count(s.begin(),s.end(),'/')+1);
assert(vec[0]=="Split");
}
Similar to regex_iterator, regex_token_iterator is a template class parameterized on the iterator type for the sequence it wraps Here, we're using sregex_token_iterator, which is a typedef for
regex_token_iterator<std::string::const_iterator> Each time the iterator it is dereferenced, it returns the current sub_match, and when the iterator is advanced, it tries to match the regular expression again These two iterator types, regex_iterator and regex_token_iterator, are very useful; you'll know that you need them when you are considering to call
regex_search multiple times!
More Regular Expressions
You have already seen quite a lot of regular expression syntax, but there's still more to know This section quickly demonstrates the uses of some of the
remaining functionality that is useful in your everyday regular expressions To begin, we will look at the whole set of repeats; we've already looked at *, +, and bounded repeats using {} There's one more repeat, and that's ? You may have noted that it is also used to declare non-greedy repeats, but by itself, it means that the expression must occur zero or one times It's also worth mentioning that the bounded repeats are very flexible; here are three different ways of using them:
boost::regex reg1("\\d{5}");
boost::regex reg2("\\d{2,4}");
boost::regex reg3("\\d{2,}");
The first regex matches exactly 5 digits The second matches 2, 3, or 4 digits The third matches 2 or more digits, without an upper limit
Another important regular expression feature is to use negated character classes using the metacharacter ^ You use it to form character classes that match any
Trang 4character that is not part of the character class; the complement of the elements you list in the character class For example, consider this regular expression
boost::regex reg("[^13579]");
It contains a negated character class that matches any character that is not one of the odd numbers Take a look at the following short program, and try to figure out what the output will be
int main() {
boost::regex reg4("[^13579]");
std::string s="0123456789";
boost::sregex_iterator it(s.begin(),s.end(),reg4);
boost::sregex_iterator end;
while (it!=end)
std::cout << *it++;
}
Did you figure it out? The output is "02468"that is, all of the even numbers Note that this character class does not only match even numbershad the input string been
"AlfaBetaGamma," that would have matched just fine too
The metacharacter we've just seen, ^, serves another purpose too It is used to denote the beginning of a line The metacharacter $ denotes the end of a line
Bad Regular Expressions
A bad regular expression is one that doesn't conform with the rules that govern regexes For example, if you happen to forget a closing parenthesis, there's no way the regular expression engine can successfully compile the regular expression When that happens, an exception of type bad_expression is thrown As I mentioned before, this name will change in the next version of Boost.Regex, and in the version that's going to be added to the Library Technical Report The exception type bad_expression will be renamed to regex_error
If all of your regular expressions are hardcoded into your application, you may be safe from having to deal with bad expressions, but if you're accepting user input in the form of regexes, you must be prepared to handle errors Here's a program that prompts the user to enter a regular expression, followed by a string to be matched against the regex As always, when there's user input involved, there's a chance that
Trang 5the input will be invalid
int main() {
std::cout << "Enter a regular expression:\n";
std::string s;
std::getline(std::cin, s);
try {
boost::regex reg(s);
std::cout << "Enter a string to be matched:\n";
std::getline(std::cin,s);
if (boost::regex_match(s,reg))
std::cout << "That's right!\n";
else
std::cout << "No, sorry, that doesn't match.\n";
}
catch(const boost::bad_expression& e) {
std::cout <<
"That's not a valid regular expression! (Error: " <<
e.what() << ") Exiting \n";
}
}
To protect the application and the user, a try/catch block ensures that if
boost::regex throws upon construction, an informative message will be printed, and the application will shut down gracefully Putting this program to the test, let's begin with some reasonable input
Enter a regular expression:
\d{5}
Enter a string to be matched:
12345
That's right!
Now, here's grief coming your way, in the form of a very poor attempt at a regular expression
Enter a regular expression:
(\w*))
That's not a valid regular expression! (Error: Unmatched ( or \() Exiting
Trang 6An exception is thrown when the regex reg is constructed, because the regular expression cannot be compiled Consequently, the catch handler is invoked, and the program prints an error message and exits There are only three places where you need to be aware of potential exceptions being thrown One is when
constructing a regular expression, similar to the example you just saw; another is when assigning regular expressions to a regex, using the member function
assign Finally, the regex iterators and the algorithms can also throw
exceptionsif memory is exhausted or if the complexity of the match grows too quickly