How would you modify the Rabin-Karp algorithm to search for a given pattern with the additional proviso that the middle character is a “‘wild card” any text character at all can match it
Trang 1STRING SEARCHTNG 253
This leads to the very simple pattelm-matching algorithm implemented below The program assumes the same i.ldex function as above, but d=32 is used for efficiency (the multiplications might be implemented as shifts)
function rksearch : integer;
const q=33554393; d=3.Z;
var hl, h2, dM, i: integer:
begin dM:=l; for i:=l to M-1 do dM:=(d*dM) mod q;
hl:=O; for i:=l to M do hl:=(hl*d+index(p[i])) mod q;
h2:=0; for i:=l to M do h2:=(h2*d+index(a[i])) mod q;
i:=l;
while (hloh2) and (i<=N-M) do begin
h2:=(h2+d*q-index(,t[i])*dM) mod q;
h2:=(h2*d+index(a[i+M])) mod q;
i:=i+l;
end ;
rksearch :=i;
end ;
The program first computes a hash valle hl for the pattern, then a hash value h2 for the first M characters of the text (Also it computes the value
of d”-’ modq in the variable dM.) Then it proceeds through the text string, using the technique above to compute the hash function for the M characters starting at position i for each i, comparing each new hash value to hl The prime q is chosen to be as large as possible, but small enough that (d+l)*q
doesn’t cause overflow: this requires less mod operations then if we used the
largest repesentable prime (An extra d*q is added during the h2 calculation
to make sure that everything stays positive so that the mod operation works
as it should.)
This algorithm obviously takes time proportional to N + M Note that
it really only finds a position in the text which has the same hash value as the pattern, so, to be sure, we really should do a direct comparison of that text with the pattern However, the use of suc:i a large value of q, made possible by
the mod computations and by the fact that we don’t have to keep the actual
hash table around, make8 it extremely unlikely that a collision will occur Theoretically, this algorithm could still take NM steps in the (unbelievably)
worst case, but in practice the algorithm can be relied upon to take about
N + M steps.
Trang 2Multiple Searches
The algorithms that we’ve been discussing are all oriented towards a specific string searching problem: find an occurrence of a given pattern in a given text string If the same text string is to be the object of many pattern searches, then it will be worthwhile to do some processing on the string to make subsequent searches efficient
If there are a large number of searches, the string searching problem can
be viewed as a special case of the general searching problem that we studied
in the previous section We simply treat the text string as N overlapping
“keys,” the ith key defined to be a[l N], the entire text string starting at position i Of course, we don’t manipulate the keys themselves, but pointers
to them: when we need to compare keys i and j we do character-by-character compares starting at positions i and j of the text string (If we use a “sentinel” character larger than all other characters at the end, then one of the keys will always be greater than the other.) Then the hashing, binary tree, and other algorithms of the previous section can be used directly First, an entire structure is built up from the text string, and then efficient searches can be performed for particular patterns
There are many details which need to be worked out in applying searching algorithms to string searching in this way; our intent is to point this out as
a viable option for some string searching applications Different methods will
be appropriate in different situations For example, if the searches will always
be for patterns of the same length, a hash table constructed with a single scan
as in the Rabin-Karp method will yield constant search times on the average
On the other hand, if the patterns are to be of varying length, then one of the tree-based methods might be appropriate (Patricia is especially adaptable to such an application.)
Other variations in the problem can make it significantly more difficult and lead to drastically different methods, as we’ll discover in the next two chapters
r-l
Trang 3STRING SEARCHING 255
Exercises
1
2.
3.
4.
5.
6.
7.
8.
9.
10
Implement a brute-force pattern ms.tching algorithm that scans the pat-tern from right to left
Give the next table for the Knuth-Morris-Pratt algorithm for the pattern
AAIWUAA
Give the next table for the Knuth-Morris-Pratt algorithm for the pattern
AERACADABRA
Draw a finite state machine which can search for the pattern AE3RACAD AFBA
How would you search a text file fo; a string of 50 consecutive blanks? Give the right-to-left skip table for the right-left scan for the pattern
AENACADABRA
Construct an example for which the right-to-left pattern scan with only the mismatch heuristic performs badly
How would you modify the Rabin-Karp algorithm to search for a given pattern with the additional proviso that the middle character is a “‘wild card” (any text character at all can match it)?
Implement a version of the Rabin-Karp algorithm that can find a given two-dimensional pattern in a given two-dimensional text Assume both pattern and text are rectangles of characters
Write programs to generate a random lOOO-bit text string, then find all occurrences of the last /c bits elsewhere in the string, for k = 5,10,15.
(Different methods might be appropriate for different values of k.)
Trang 520 Pattern Matching
It is often desirable to do string searching with somewhat less than complete information about the pattern to be found For example, the user of a text editor may wish to specif;r only part of his pattern, or he may wish to specify a pattern which could match a few different words, or he might wish to specify that any number of occurrences of some specific characters should be ignored In this chapter we’ll consider how pattern matching of this
type can be done efficiently
The algorithms in the previous chapter have a rather fundamental depen-dence on complete specification of the pattern, so we have to consider different methods The basic mechanisms that we will consider make possible a very powerful string searching facility which can match complicated M-character patterns in N-character text strings in time proportional to MN.
First, we have to develop a way to describe the patterns: a “language” that can be used to specify, in a rigorous way, the kinds of partial string searching problems suggested above Th s language will involve more powerful primitive operations than the simple “check if the ith character of the text string matches the jth character of the pattern” operation used in the previous chapter In this chapter, we consider three basic operations in terms of an imaginary type of machine that has the capability for searching for patterns
in a text string Our pattern-matching algorithm will be a way to simulate the operation of this type of machine [n the next chapter, we’ll see how to translate from the pattern specification (which the user employs to describe his string searching task) to the machine specification (which the algorithm employs to actually carry out the search)
As we’ll see, this solution to the pattern matching problem is intimately related to fundamental processes in computer science For example, the method that we will use in our program to perform the string searching task implied by a given pattern description is akin to the method used by the
257
Trang 6Pascal system to perform the computational task implied by a given Pascal program
Describing Patterns
We’ll consider pattern descriptions made up of symbols tied together with the following three fundamental operations
(i) Concatenation This is the operation used in the last chapter If two characters are adjacent in the pattern, then there is a match if and only
if the same two characters are adjacent in the text For example, AI3 means A followed by B
(ii) Or This is the operation that allows us to specify alternatives in the pattern If we have an “or” between two characters, then there is a match if and only if either of the characters occurs in the text We’ll denote this operation by using the symbol + and use parentheses to allow it to be combined with concatenation in arbitrarily complicated ways For example, A+B means “either A or B”; C(AC+B)D means
“either CACD or CBD”; and (A+C)((B+C)D) means “either ABD or CBD or ACD or CCD.”
(iii) Closure This operation allows parts of the pattern to be repeated arbitrarily If we have the closure of a symbol, then there is a match if and only if the symbol occurs any number of times (including 0) Closure will be denoted by placing a * after the character or parenthesized group
to be repeated For example, AB* matches strings consisting of an A followed by any number of B’s, while (AB)” matches strings consisting
of alternating A’s and B’s
A string of symbols built up using these three operations is called a regular expression Each regular expression describes many specific text patterns.
Our goal is to develop an algorithm which will determine if any of the patterns described by a given regular expression occur in a given text string
We’ll concentrate on concatenation, or, and closure in order to show the basic principles in developing a regular-expression pattern matching al-gorithm Various additions are commonly made in actual systems for con-venience For example, -A might mean “match any character except A.”
This not operation is the same as an or involving all the characters except
A but is much easier to use Similarly, “7” might mean “match any letter.” Again, this is obviously much more compact than a large or Other examples
of additional symbols which might make specification of large patterns easier are symbols which match the beginning or end of a line, any letter or any number, etc
These operations can be remarkably descriptive For example, the pattern description ?*(ie + ei)?’ matches all words which have ie or ei in them (and so
Trang 7PATTERN h4ATCHliVG
are likely to be misspelled!); and (1 + 01)’ (0 + 1) describes all strings of O’s and l’s which do not have two consecutive 0’~ Obviously there are many different pattern descriptions which describe the same strings: we must try to specify succinct pattern descriptions just as we ;ry to write efficient algorithms The pattern matching algorithm that we’ll examine may be viewed as
a generalization of the brute force left-to-right string searching method (the first method that we looked at in Chapter 19) The algorithm looks for the leftmost substring in the text string which matches the pattern description by scanning the text string from left to rignt, testing, at each position whether there is a substring beginning at that position which matches the pattern description
Pattern Matching Machines
Recall that we can view the Knuth-Morris-Pratt algorithm as a finite-state machine constructed from the search pattern which scans the text The method we will use for regular-expressior pattern matching is a generalization
of this
The finite-state machine for the Knuth-Morris-Pratt algorithm changes from state to state by looking at a character from the text string and then changing to one state if there’s a match, to another state if not A mismatch
at any point means that the pattern couldn’t occur in the text starting at that point The algorithm itself can be thought of as a simulation of the machine The characteristic of the machine that makes it easy to simulate is that it
is deterministic: each state transition is completely determined by the next
input character
To handle regular expressions, it will be necessary to consider a more powerful abstract machine Because of the or operation, the machine can’t determine whether or not the pattern could occur at a given point by examin-ing just one character; in fact, because 0:’ closure, it can’t even determine how many characters might need to be exam:ned before a mismatch is discovered The most natural way to overcome these problems is to endow the machine with the power of nondeterminism: when faced with more than one way to
try to match the pattern, the machine should “guess” the right one! This operation seems impossible to allow, but, we will see that it is easy to write a program to simulate the actions of such a machine
For example, the following diagram shows a nondeterministic finite-state machine that could be used to search for the pattern description (A*B+AC)D
in a text string
Trang 8As in the deterministic machine of the previous chapter, the machine can travel from a state labeled with a character to the state “pointed to” by that state by matching (and scanning past) that character in the text string What makes the machine nondeterministic is that there are some states (called null states) which not only are not labeled, but also can “point to” two different successor states (Some null states, such as sta.te 4 in the diagram, are “no-op” states with one exit, which don’t affect the operation of the machine, but which make easier the implementation of the program which constructs the machine, as we’ll see State 9 is a null state with no exits, which stops the machine.) When in such a state, the machine can go to either successor
state regardless of what’s in the input (without scanning past anything) The machine has the power to guess which transition will lead to a match for the given text string (if any will) Note that there are no “non-match” transitions
as in the previous chapter: the machine fails no find a match only if there is
no way even to guess a sequence of transitions that leads to a match The machine has a unique initial state (which is pointed to by a “free” arrow) and a unique final state (which has no arrows going out) When started out in the initial state, the machine should be able to “recognize” any string described by the pattern by reading characters and changing state according
to its rules, ending up in the “final state.” Because it has the power of nondeterminism, the machine can guess the sequence of state changes that can lead to the solution (But when we try to simulate the machine on a standard computer, we’ll have to try all the possibilities.) For example, to determine if its pattern description (A*B+AC)D can occur in the text string
CDAABCAAABDDACDAAC the machine would immediately report failure if started on the first or second character; it would work some to report failure on the next two characters; it would immediately report failure on the fifth or sixth characters; and it would guess the sequence of state transitions
Trang 9PATTERN iMATCHING
261
1 2 3 4 8 9
to recognize AAAE%D if started on the seventh character
We can construct the machine for a given regular expression by building partial machines for parts of the expref#sion and defining the ways in which two partial machines can be composed into a larger machine for each of the three operations: concatenation, or, and closure
We start with the trivial machine to recognize a particular character It’s convenient to write this as a two-state machine, with one initial state (which also recognizes the character) and one final state, as below:
-49-O
Now to build the machine for the concatenation of two expressions from the machines for the individual expressions, we simply merge the final state of the first with the initial state of the second:
Similarly, the machine for the or operation is built by adding a new null state pointing to the two initial states, and making one final state point to the other, which becomes the final state of l,he combined machine
Finally, the machine for the closure operation is built by making the final state the initial state and making it point back to the old initial state and a new final state
Trang 10A machine can be built which corresponds to any regular expression by successively applying these rules The numbers of the states for the example machine above are in order of creation as the machine is built by scanning the pattern from left to right, so the construction of the machine from the rules above can be easily traced Note that we have a 2-state trivial machine for each letter in the regular expression, and each + and * causes one state to
be created (concatenation causes one to be deleted) so the number of states is certainly less than twice the number of characters in the regular expression
Representing the Machine
Our nondeterministic machines will all be constructed using only the three composition rules outlined above, and we can take advantage of their simple structure to manipulate them in a straightforward way For example, no state has more than two arrows leaving it In fact, there are only two types of states: those labeled by a character from the input alphabet (with one arrow leaving) and unlabeled (null) states (with two or fewer arrows leaving) This means that the machine can be represented with only a few pieces of information per node For example, the machine above might be represented as follows:
State Character Next I Next 2
The rows in this table may be interpreted as instructions to the nondeter-ministic machine of the iorm “If you are in State and you see Character then
scan the character and go to state Next I (or Next 2)” State 9 is the final
state in this example, and State 0 is a pseudo-initial state whose Next I entry