Unlike +, which requires at least one match, * matches any number of matches if present, but does not require that any be present.. matches optional text and so zero instances will match
Trang 1This pattern is looking increasingly complex (but it actually is not), so let's look at
it together \w+ matches any alphanumeric character but not (the valid
characters with which to start an email address) After the initial valid characters, it
is indeed possible to have a and additional characters, although these may in fact not be present [\w.]* matches zero or more instances of or alphanumeric characters, which is exactly what was needed
Note
Think of * as being the make it optional metacharacter Unlike +,
which requires at least one match, * matches any number of
matches if present, but does not require that any be present
* is a metacharacter To match an * you'll need to escape it as \*
Matching Zero or One Character
One other very useful metacharacter is ? Like +, ? matches optional text (and so zero instances will match) But unlike +, ? matches only zero or one instance of a character (or set), but not more than one As such, ? is very useful for matching specific, single optional characters in a block of text
Consider the following example:
The URL is http://www.forta.com/, to connect
securely use https://www.forta.com/ instead
http://[\w./]+
Trang 2The URL is http://www.forta.com/, to connect
securely use https://www.forta.com/ instead
The pattern used to match a URL is http:// (which is literal text and therefore matches only itself) followed by [\w./]+, which matches one or more instances
of a set that allows alphanumeric characters, , and forward slash This pattern can match only the first URL (the one that starts with http://) but not the second (the one that starts with https://) And s* (zero or more instances of s) would not be correct because that would then also allow httpsssss:// (which is definitely not valid)
The solution? Use s? as seen in the following example:
The URL is http://www.forta.com/, to connect
securely use https://www.forta.com/ instead
https?://[\w./]+
The URL is http://www.forta.com/, to connect
securely use https://www.forta.com/ instead
The pattern here begins with https?:// ? means that the preceding character (the s) should be matched if it is not present, or if a single instance of it is present
In other words, https?:// matches both http:// and https:// (but
nothing else)
Trang 3Incidentally, using ? is the solution to a problem alluded to in the previous lesson You looked at an example where \r\n was being used to match an end of line, and I mentioned that on Unix or Linux boxes, you would need to use \n (without
\r) and that an ideal solution would be to match an optional \r followed by \n That example follows again, this time using a slightly modified regular expression:
"101","Ben","Forta"
"102","Jim","James"
"103","Roberta","Robertson"
"104","Bob","Bobson"
[\r]?\n[\r]?\n
"101","Ben","Forta"
"102","Jim","James"
"103","Roberta","Robertson"
"104","Bob","Bobson"
[\r]?\n matches an optional single instance of \r followed by a required \n Tip
Trang 4You'll notice that the regular expression here used [\r]? instead
of \r? [\r] defines a set containing a single metacharacter, a
set of one, so [\r]? is actually functionally identical to \r? []
is usually used to define a set of characters, but some developers
like to use it even around single characters to prevent ambiguity
(to make it stand out so that you know exactly what the following
metacharacter applies to) If you are using both [] and ?, make
sure to place the ? outside of the set Therefore, http[s]?://
is correct, but http[s?]:// is not
Tip
? is a metacharacter To match an ? you'll need to escape it as \?
Using Intervals
+, *, and ? are used to solve many problems with regular expressions, but
sometimes they are not enough Consider the following:
+ and * match an unlimited number of characters They provide no way to set a maximum number of characters to match
The only minimums supported by +, *, and ? are zero or one They provide
no way to set an explicit minimum number of matches
There is also no way to specify an exact number of matches desired
To solve these problems, and to provide a greater degree of control over repeating matches, regular expressions allow for the use of intervals Intervals are specified between the { and } characters
Note
{ and } are metacharacters and, as such, should be escaped using
\when needed as literal text It is worth noting that many regular
expression implementations seem to be able to correctly process {
and } even if they are not escaped (being able to determine when
they are literal and when they are metacharacters) However, it is
best not to rely on this behavior and to escape the characters when
Trang 5needing them as literals
Exact Interval Matching
To specify an exact number of matches, you place that number between { and } Therefore, {3} means match three instances of the previous character or set If there are only 2 instances, the pattern would not match
To demonstrate this, let's revisit the RGB example (used in Lessons 3 and 4) You will recall that RGB values are specified as three sets of hexadecimal numbers (each of 2 characters) The first pattern used to match an RGB value was the following:
#[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
In Lesson 4, you used a POSIX class and changed the pattern to
#[[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]]
The problem with both patterns is that you had to repeat the exact character set (or class) six times Here is the same example, this time using interval matching:
<BODY BGCOLOR="#336633" TEXT="#FFFFFF"
MARGINWIDTH="0" MARGINHEIGHT="0"
TOPMARGIN="0" LEFTMARGIN="0">
#[[:xdigit:]]{6}
Trang 6<BODY BGCOLOR="#336633" TEXT="#FFFFFF"