Anchors are technically called zero-width assertions because they correspond to positions, not actual characters in a string; for example, /^abc/ means find abc at the beginning of the l
Trang 1Figure 12.11 Splitting with multiple delimiters Output from Example 12.10
Trang 2preg_split() PREG_SPLIT_NO_EMPTY
Trang 4Figure 12.13 Splitting up a string with the preg_split() function Output from Example 12.12
Other related PHP functions are: spliti(), split(), implode(), and explode() See Chapter 8, “Arrays,” for more on these
Trang 5The preg_grep() Function
Similar to the UNIX grep command, the preg_grep() function returns an array of values that match a pattern
found in an array instead of a search string You can also invert the search and get an array of all elements that do not
contain the pattern being searched for (like UNIX grep -v) by using the PREG_GREP_INVERT flag
Format
array preg_grep ( string pattern, array input [, int flags] )
Example:
$new_array = preg_grep("/ma/", array("normal", "mama", "man","plan")); //
$new_array contains: normal, mama, man
3 $newarray=preg_grep( $regex, $search_array );
4 print "Found " count($newarray) " matches\n";
Trang 6preg_grep() /Pat/
Trang 712.2.3 Getting Control—The RegEx Metacharacters
Regular expression metacharacters are characters that do not represent themselves They are endowed with special powers to allow you to control the search pattern in some way (e.g., finding a pattern only at the beginning of the line,
or at the end of the line, or if it starts with an upper- or lowercase letter) Metacharacters will lose their special meaning
if preceded with a backslash For example, the dot metacharacter represents any single character, but when preceded with a backslash is just a dot or period
If you see a backslash preceding a metacharacter, the backslash turns off the meaning of the metacharacter, but if you see a backslash preceding an alphanumeric character in a regular expression, then the backslash is used to create a metasymbol A metasymbol provides a simpler form to represent some of regular expression metacharacters For example, [0-9] represents numbers in the range between 0 and 9, and \d represents the same thing [0-9] uses the bracketed character class, whereas \d is a metasymbol (see Table 12.6)
Table 12.6 Metacharacters
Metacharacter
Matches any single character in a set
[a-z0-9] [^a-z0-9]
530)
Matches a nondigit, same as [^0-9]
Matches an alphanumeric (word) character
Matches a nonword boundary
Matches to beginning of line
Matches to end of line
Matches the beginning of the string only
Trang 8Matches first set of parentheses
Matches second set of parentheses
The expression reads: Search at the beginning of the line for a letter a, followed by any three single characters,
followed by a letter c It will match, for example, abbbc, a123c, a c, aAx3c, and so on, only if those patterns were found at the beginning of the line
Trang 9In the following examples, we perform pattern matches, searches, and replacements based on the data from a text file called data.txt In the PHP program, the file will be opened and, within a while loop, each line will be read The functions discussed in the previous section will be used to find patterns within each line of the file The regular
expressions will contain metacharacters, described in Table 12.6
Anchoring Metacharacters
Often it is necessary to find a pattern only if it is found at the beginning or end of a line, word, or string The
“anchoring” metacharacters (see Table 12.7) are based on a position just to the left or to the right of the character that is being matched Anchors are technically called zero-width assertions because they correspond to positions, not actual characters in a string; for example, /^abc/ means find abc at the beginning of the line, where the ^ represents a position, not an actual character
Table 12.7 Anchors (Assertions)
The ^ metacharacter is called the beginning-of-line anchor It is the first character in the regular expression and matches
a pattern found at the beginning of a line or string
Trang 10- (Output)
Steve Blenheim 100
Trang 11fgets() preg_match()
Word Boundaries
A word boundary is represented in a regular expression by the metasymbol \b You can search for the word that begins with a pattern, ends with a pattern, or both begins and ends with a pattern; for example, /\blove/ matches a word beginning with the pattern love, and would match lover, loveable, or lovely, but would not find glove /love\b/ matches a word ending with the pattern love, and would match glove, clove, or love, but not clover /\blove\b matches a word beginning and ending with the pattern love, and would match only the word love
Trang 12- (The Output)
Matching Single Characters and Digits
There are metacharacters to match single characters or digits, and single noncharacters or nondigits, whether in or not in
a set
The Dot Metacharacter
The dot metacharacter matches for any single character with exception to the newline character For example, the regular expression /a.b/ is matched if the string contains a letter a, followed by any one single character (except the
\n), followed by a letter b, whereas the expression / / matches any string containing at least three characters To match on a literal period, the dot metacharacter must be preceded by a backslash; for example, /love\./ matches on love not lover
Jon DeLoach 500
Explanation
data.txt
Trang 13whileexecute
fgets() /^ /
Trang 14$text Daniel
The Character Class
A character class represents one character from a set of characters For example, [abc] matches either an a, b, or c; [a-z] matches one character from a set of characters in the range from a to z; and [0-9] matches one character in the range of digits between 0 to 9 If the character class contains a leading caret ^, then the class represents any one character not in the set; for example, [^a-zA-Z] matches a single character not in the range from a to z or A to Z, and [^0-9] matches a single digit not in the range between 0 and 9 (see Table 12.8)
Table 12.8 Character Classes
If you are searching for a particular character within a regular expression, you can use the dot metacharacter to represent a single character, or a character class that matches on one character from a set of characters In addition to the dot and character class, PHP supports some backslashed symbols (called metasymbols) to represent single characters
Matching One Character from a Set
A regular expression character class represents one character out of a set of characters, as shown in Example 12.19
Trang 15fgets() /^[BKI]/
B K I
Matching One Character in a Range
A character class can also be represented as a range of characters by placing a dash between two characters, the first being the start of the range and the second the end of the range; for example, [0-9] represents one character in the range between 0 and 9 and [A-Za-z0-9] represents one alphanumeric character If you want to represent a range between 10 and 13, the regular expression would be /1[0-3]/, not /[10-13]/ because only one character can be matched in a character class
Trang 16fgets() /[E-M]/
Trang 17- (Output)
Matching One Character Not in a Set
When a character set contains a caret right after the opening square bracket, then the search is inversed; that is, the regular expression represents one character not in the set or in the range For example, [^a-z] represents one character that is not in the range between a and z
Mama Bear 702
Steve Blenheim 100
Trang 18Metasymbols Representing Digits and Spaces
The character class [0-9] represents one digit in the range between 0 and 9, as does the metasymbol \d To create a regular expression that matches on three digits, you could write /[0-9][0-9][0-9]/ or simply /\d\d\d/ To represent a space, you can either insert a blank space, or use the metasymbol \s
Trang 19Jon DeLoach 500
Karen Evich 600
BB Kingson 803
- (The PHP Program)
Metasymbols Representing Alphanumeric Word Characters
The metasymbol to represent one alphanumeric word character is \w, much easier to write than [a-zA-Z0-9_] To represent not one alphanumeric character, you simply capitalize the metasymbol, \W, which is the same as [^a-zA- Z0-9_]
<?php
1 $fh=fopen("data.txt", "r");
Trang 20MamaXXear 702
Trang 21Metacharacters to Repeat Pattern Matches
In the previous examples, the metacharacter matched on a single character What if you want to match on more than one character? For example, let’s say you are looking for all lines containing names and the first letter must be in uppercase, which can be represented as [A-Z], but the following letters are lowercase and the number of letters varies in each name [a-z] matches on a single lowercase letter How can you match on one or more lowercase letters? Zero or more lowercase letters? To do this you can use what are called quantifiers To match on one or more lowercase letters, the regular expression can be written: /[a-z]+/ where the + sign means “one or more of the previous characters,” in this case, one or more lowercase letters PHP provides a number of quantifiers as shown in Table 12.10
Table 12.10 The Greedy Metacharacters
Trang 22The Greed Factor
Normally quantifiers are greedy; that is, they match on the largest possible set of characters starting at the left side of the string and searching to the right, looking for the last possible character that would satisfy the condition For
example, given the string:
9 It is called greedy because the matching continues until the last number is found, in this example the number 7 The pattern ab and all of the numbers in the range between 0 and 9 are replaced with a single X
Greediness can be turned off so that instead of matching on the maximum number of characters, the match is made on the minimal number of characters found This is done by appending a question mark after the greedy metacharacter See Example 12.26
Steve Blenheim 100
Jon DeLoach 500
Trang 23The * Metacharacter and Greed
The * metacharacter is often misunderstood as being a wildcard to match on everything, but it only matches the character that precedes it In the regular expression, /ab*c/, the asterisk is attached to the b, meaning that zero or more occurrences of the letter b will be matched The strings abc, abbbbbbbc, and ac would all be matched
Trang 24}
}
?>
- (Output)
Trang 26The + Metacharacter and Greed
The + metacharacter attaches itself to the preceding character and matches on one or more of that character
Mama Bear 702
Steve Blenheim 100
Betty Boop 200
Trang 27Matching for Repeating Characters
To match for a character that is repeated a number of times, the character is followed by a set of curly braces containing
a number to represent how many times the pattern should be repeated (see Table 12.11) A single number within the curly braces (e.g., {5}), represents an exact amount of occurrences; two numbers separated by a comma (e.g.,
{3,10}), represents an inclusive range; and a number followed by a comma (e.g., {4,}), represents a number of characters and any amount after that
Table 12.11 Repeating Characters
Trang 28- (Output)
Metacharacters That Turn Off Greediness
By placing a question mark after a greedy quantifier, the greed is turned off and the search ends after the first match, rather than the last one
Trang 296 $newtext=preg_replace("/B.*? /","John ",$text);
echo "$newtext";
}
?>
- (Output)
rewind()($fh
while
B
John”
Metacharacters for Alternation
Alternation allows the regular expression to contain alternative patterns to be matched; for example, the regular expression /John|Karen|Steve/ will match a line containing John or Karen or Steve If Karen, John, or
Trang 30Steve are all on different lines, all lines are matched Each of the alternative expressions is separated by a vertical bar (pipe symbol) and the expressions can consist of any number of characters, unlike the character class that only matches for one character; that is, /a|b|c/ is the same as [abc], whereas /ab|de/ cannot be represented as [abde] The pattern /ab|de/ is either ab or de, whereas the class [abcd] represents only one character in the set, a, b, c, or d
Trang 31}
}
?>
- (Output)
Steve Blenheim 100
Explanation
data.txt
whileexecute
Trang 32- (The PHP Script)
<?php
1 $fh=fopen("data.txt", "r");
2 while( ! feof($fh)){
Trang 34- (Output)
Searching, Capturing, and Replacing
If the search pattern contains parenthesized (captured) strings, those subpatterns can be referenced in the replacement side by either backslashed numbers such as \1, \2, up to \9, or the preferred way since PHP 4.0.4, with $1, $2, up to
$99 The number refers to the position where the parenthesized pattern is placed in the search pattern (left to right); for example, the first captured string is referenced in the replacement string as $1, the second as $2, and so on $0 or \0 refers to the text matched by the entire pattern
<?php
1 $fh=fopen("data.txt", "r");
2 while( ! feof($fh)){
Trang 35<?php
Trang 36preg_split() list()
split() preg_replace()
Example 12.40
(The File moredata.txt Contents)
Trang 38Mama monkey Mama bird Papa
as criteria for the search
Trang 39}
?>
- (Output)
"[BM][a-z]+"
Mama Goose Norma Goose
Commenting Regular Expressions and the x Modifier
You can add whitespace and comments to a regular expression if you want to clarify how the regular expression is broken down and what each symbol means This is very helpful in unraveling a long regular expression you might have inherited from another program and are not sure of what is taking place To do this the closing delimiter is appended with the x modifier
^ # At the beginning of the line
( # start a new subpattern $1
[A-Z] # Find an uppercase letter
[A-Za-z] # find an upper or lowercase letter
Trang 40* # match it zero or more times
) # close first subpattern
\s # find a whitespace character
( # start another subpattern $2
[A-Z] # match an uppercase letter
[a-zA-Z] # match an upper or lowercase letter
+ # match for one or more of them
) # close the subpattern
\s # match a whitespace character
( # start subpattern $3
\d # match a digit
{3} # match it three times
) # close the subpattern
12.2.4 Searching for Patterns in Text Files
You might be using text files, rather than a database, to store information You can perform pattern matching with regular expressions to find specific data from a file using the PHP built-in functions such as preg_match(), preg_replace(), and so on In the following example, a form is provided so that the user can select all names and phone numbers within a particular area code found in a text file
Trang 41<font face="verdana" size="+1">
<form action="patterns.php" method="POST">
<p>
Please enter the area code
1 <input type="text" name="area_code" size=5>
echo "<H2>Names and Phones in $area_code area code</h2>";
5 foreach ($lines as $the_line) {