Đánh giá độ phúc tạp
Trang 1Maximal common subsequences and minimal
common supersequences*
Campbell B Fraser’ Robert W Irving Computing Science Department University of Glasgow Glasgow G12 8QQ scotland Martin Middendorf Institut fur Angewandte Informatik und Formale Beschreibungsverfahren D-76128 Universitat Karlsruhe,
‘Supported by a postgraduate research studentship from the Engineering and Physical Sciences Research Council
Trang 2Proposed running head:
Common subsequences and supersequences
Trang 4Key words: string algorithms, subsequence, supersequence, dy- namic programming, NP-hard optimisation problems, approxima- tion algorithms
A subsequence of a string @ is any string that can be obtained from a by the deletion of zero or more symbols A supersequence of a is any string that can
be obtained from @ by the insertion of zero or more symbols Given a set
S of & strings, a common subsequence of S is a string that is a subsequence
of every string in S and a common supersequence of S is a string that is a supersequence of every string in S
The longest common subsequence (LCS) and shortest common superse- quence (SCS) problems are classical problems of stringology, with important applications in computational biology, file comparison, data compression, etc
A common subsequence @ is mazimal if no proper supersequence of œ is also a common subsequence of S — in other words, if a is not contained as
a subsequence in any longer common subsequence of S A shortest mazt-
mal common subsequence (smes) of S is a maximal common subsequence of
shortest possible length Clearly, a maximal common subsequence of great- est possible length is just a longest common subsequence, a concept that has been widely explored in the literature
A common supersequence @ is minimal if no proper subsequence of œ is also a common supersequence of S — in other words, if a does not contain as
a subsequence any shorter common supersequence of S A longest minimal common supersequence (Imcs) of S is a minimal common supersequence
of longest possible length Clearly, a minimal common supersequence of shortest possible length is just a shortest common supersequence
Example If a; = abc, ag = bca, the maximal common subsequences are a, 6c, and the unique smcs is a, of length 1
The minimal common supersequences are bcabc, abca, bacbac, and the unique Imes is bacbac, of length 6
In this paper, we study the shortest maximal common subsequence prob- lem (SMCS) and and the longest minimal common supersequence problem (LMCS) from the complexity point of view The problems are of genuine interest in their own right, although the original motivation for the study of maximal common subsequences and minimal common supersequences was
Trang 5in the context of approximation algorithms for LCS and SCS For instance, any approximation algorithm for the LCS of a set of strings will return a common subsequence of the strings, which can then be made maximal by the insertion of zero or more additional characters The question arises as
to the extent to which the length of such a maximal common subsequence might differ from the length of a longest common subsequence A similar situation holds in the case of approximation algorithms for the SCS of a set
of strings
We show that, like the LCS and SCS problems, both of these new prob- lems can be solved in polynomial time by dynamic programming for k = 2 (and, by extending the algorithms, for any fixed value of &) However, the dynamic programming algorithms are not quite so straightforward as those for the LCS and SCS problems, and have complexities O(m?n) and
O(mn(m+n)) respectively for strings of lengths m and n Note that the
existence of polynomial-time algorithms for the SMCS and LMCS problems
in the case of two strings is by no means obvious Consider the problem of finding a maximum cardinality matching in a bipartite graph This problem
is well known to be solvable in polynomial time, whereas the problem of
finding a minimum maximal bipartite matching is NP-hard [YG80]
We also show that, as is the case for the LCS and SCS problems, SMCS and LMCS are NP-hard when the number of strings & becomes a problem parameter However, we pinpoint one interesting difference between the SCS and LMCS problems, namely that the latter, unlike the former, is solvable in polynomial time for strings of length 2 Furthermore, we prove
a strong negative result regarding the likely existence of good polynomial- time approximation algorithms for SMCS in the case of general & We leave open the possibility of a good polynomial-time approximation algorithm for LMCS
When restricted to the case of just two strings a and £ of lengths m and
n respectively, the classical LCS and SCS problems are easily solvable in O(mn) time by dynamic programming Indeed, in this case, the LCS and SCS problems are dual, in that s = m-+n-— 1, where / and s are the lengths of a longest common subsequence and shortest common superse- quence respectively Much effort has gone in to finding refinements of dy- hamic programming and other approaches which lead to improvements in
Trang 6complexity in many cases See, for example, [AG87, Hir75, Hir77, HS77, Ukk85, WMMM90)}
The following simple example serves to illustrate the fact that there is
no obvious corresponding duality between the SMCS and LMCS problems
in the case of two strings
Example Let a = abc, 6 = dab Then the only maximal common subsequence
is ab of length 2, while dabc, abcdab and abdcab are some of the minimal common supersequences, the latter two being lmcs’s
It is true, however, that if + is a maximal common subsequence of length
r of œ and Ø, then forming an alignment of œ and @ in which the elements
of y are matched reveals a minimal common supersequence, 6 say, of a and
G of length m+n-—vr For if 6 were not minimal, then some single symbol
c could be removed from 6 to give a shorter common supersequence 6’ But the symbol of œ or Ø (or both) represented by c in the alignment must be matched with new symbols in the alignment corresponding to 6’ , leading
to a contradiction of the maximality of y Hence, if l’ is the length of an smcs, and s’ the length of an Imcs, it follows that s’ > m+n-—-—U' That the inequality can be strict is shown by the above example
Hence the question arises as to whether either or both of the SMCS and LMCS problems can be solved in polynomial time, by dynamic programming
or otherwise
In this section we describe a polynomial-time algorithm to determine the length of an smcs of two strings, and in the following section a polynomial- time algorithm to determine the length of an Imcs of two strings It turns out that these algorithms will determine the lengths of all maximal common subsequences and all minimal common supersequences respectively They will also allow the construction of an scms, and indeed of all the maximal common subsequences (respectively an lmcs, and all minimal common su- persequences) of the two strings
The algorithms use a dynamic programming approach based, as usual,
on a table that relates the ith prefix a' = a[1 2] of a and the jth prefix G7 = Bf1 j) of B, fori =1, ,m,7=1, ,n, where m,n are the lengths
of a, 2 respectively However, as we shall see, for each 2,7 we need retain rather more information than merely the lengths of the maximal common subsequences, or of the minimal common supersequences, of a? and #
The SMCS Algorithm
Trang 7Given a string a and a subsequence y¥ of a, we define
sp(a, 7) = length of the shortest prefix of a that is a supersequence of + Given strings a, 3 of lengths m and n respectively, we define the set 5;;, [or each 2=1, ,m,7=1, ,n, by
Si; = {(r, (2, y)) : of and § have a maximal common subsequence + of length r, and sp(œ, y) = z, sp(Ø, +) = y}
with
Soo = {(0, (0, 0))}
For string a of length m, position ¢ and symbol a, we define
next,(i,a) = min{È : œ[k] = a,k >¿} — if such ak exists
of œ and #) Furthermore, the use of suitable tracebacks in the array of 5;; values, can be used to generate, not only an smcs, but all maximal common subsequences
The basis of the dynamic programming scheme is contained in the fol- lowing theorem:
Theorem 1 (i) If afi] = Blj] =a then
Siz = {(r, (netta(x, a), nextg(y,a))) : (r — 1, (@, y)) € Si-a,j-1}- (it) If oft] # 00] then
Sig = {rs (@,9)) € Si-ag} UUM (4, y)) © Sig—a} U (Si-1,7 9 5i,5-1)-
Proof (i) Suppose (r — 1, (z,)) € S;-1;-1 and that 7’ is a maximal com-
mon subsequence of a*~* and 6/~' of length r — 1 with sp(a,7') = x
Trang 8and sp(G,7') = y Then it is immediate that y = 7/ +4 is a maxi-
mal common subsequence of a* and (67, and that sp(œ,+y) = next,(z, a),
sp(B,7) = nextg(y, a)
On the other hand, suppose that 7 is a maximal common subsequence
of length r of a’ and Ø? Then the last symbol of 7 is a, and 7’ = y—a
is certainly a common subsequence of a‘~! and @Ÿ~!, If it were not max- imal, then some supersequence 6 of +’ would be a common subsequence
of a'-! and #7—', and therefore 5 + a, a supersequence of 7, would be a common subsequence of o* and @®Ÿ, contradicting the maximality of y So (r — 1, (sp(a, 7’), sp(8,7')) © Si-1j-1 and (r, (sp(@, 7), sp(8,7)) € Siz with
sp(a, 7) = next, (sp(a, 7’), a) and sp(B, 7) = nexts(sp(G, 7’), 4)
(ii) Suppose (r, (@,7)) € S;-1;, and that y is a maximal common subse-
quence of a‘~! and of length r with sp(a, y) = x, sp(8, 7) = 7 Then + is
a common subsequence of o* and 37, and must be maximal since y+al[?] can-
not be a subsequence of 6% A similar argument holds for (r, (¢, y)) € Sij-1-
So tứ, (x, 7) € 5-17} U tứ, (2, y)) € Sig—t} C Sij-
Further, IÍ (r,(z,)) € Si1; 9 5;;-1, then there is a string y with
sp(a,y) = 2 <4, sp(B,y) =y < j, of length r, which is a maximal common subsequence of a’ and Ø?~!, and of a’! and G7 So y must also be a maxi- mal common subsequence of a and @Ÿ? For any supersequence of ¥ that is
a subsequence of a? and 34 must either be a subsequence of a* and @?~†, or
of œ~! and 6Ÿ
On the other hand, suppose that + 1s a maximal common subsequence
of length r of a’ and #@#
case (iia) sp(a,y) = ¿ Then ¥ is a maximal common subsequence of a‘ and Ø?~°, and so (r, (¢, y)) € S;,;-1 for some y
case (iib) sp(G,7) = 7 Then ¥ is a maximal common subsequence of a‘~ and Ø, and so (r, (+, 7)) € 5;_1; for some z
case (iic) sp(œ,+) < +,sp(Ø, +) < 7 Then + is both a maximal common subsequence of a*—' and 3, and of a’ and Ø?~! So (r, (sp(œ, +), sp(Ø, y))) €
55-15 1555-1
This completes the proof of the theorem O
1
Recovering a Shortest Maximal Common Subsequence
The recovery of a particular smcs involves a standard type of traceback through the dynamic programming table from cell (m,n), during which the sequence is constructed in reverse order To facilitate this traceback,
Trang 9each entry in position (2,7) in the table (for all ¢,7) should have associ- ated with it, during the application of the dynamic programming scheme,
one or more pointers indicating which particular element(s) in cells (¢ —
1,7), (2,7 — 1) or (£— 1,7 — 1) led to the inclusion of that element in cell
(2,7) Eor example, if ø|¿] = Bly] = a, and (r — 1,(z,)) € 5;_+;_¡ then (r, (nezt„(œ, a), nezks(, a)) is placed in cell (2, 7) with a pointer to the ele- ment (r — 1, (z,#)) in cell (¿— 1,7 — 1)
With these pointers, any path from an element (r, (x, y)) in cell (m,n)
to the element in cell (0,0) represents a maximal common subsequence of a and £ of length r, namely the reversed sequence of matching symbols from the two strings corresponding to cells from which the path takes a diagonal step
Analysis of the SMCS Algorithm
The number of cells in the dynamic programming table is essentially mn,
so that if we could show that the number of entries in each cell was bounded
by, say, min(m,n), and that the total amount of computation was bounded
by a constant times the total number of table entries, then we would have a cubic time worst-case bound for the complexity of the algorithm However, this turns out not to be the case, as the following example shows
Example Consider two strings of length n = p(p + 1)/2+ q¢ over an alphabet
% = {a1, ,@n}, defined as follows
However, suppose that we wish to find only the length of an smcs (and
to construct such a sequence by traceback through the table) Then, if any
particular cell in the table contains more than one entry (r, (z,y)) with the
same (x,y) component, we may discard all but the one with the smallest r value For if a maximal common subsequence 7 has a prefix +’ such that
Trang 10sp(œ, +) = z and sp(Ø, +?) = y, then to make ¥ as short as possible, +’ must
be chosen as short as possible
Also, if the entries (r, (a, y)) in the (¢,7)th cell are listed in increasing
order of z, then they must clearly also be in decreasing order of y, and therefore, since < ?, y < 7, the number of such entries with distinct (x, y) components cannot exceed min(?, 7) Further, it is easy to see that by processing the lists of cell entries in this fixed order, the amount of work done
in computing the contents of cell (7,7) is, in case (i) bounded by a constant
times the number of entries in cell (¢ — 1,7 — 1), and in case (ii) bounded
by a constant times the sum of the numbers of entries in cells (¢— 1,7) and
(2,7 — 1) (In case (i), this assumes precomputation of the tables of nezt values, which can easily be achieved in O(n|%|) time for a string of length
n, where & is the alphabet.)
In conclusion, the length of an smcs can be established by a suitably
amended version of the above dynamic programming scheme in O(m?n) time
in the worst case, for strings of lengths m and n (m < n) Furthermore, such
a subsequence can also be constructed from the dynamic programming table without increasing that overall time bound But it remains open whether the lengths of all maximal common subsequences can be established within that time bound A trivial bound of O(m?n) applies in that case, since the number of entries in each cell is certainly bounded by m?
The LMCS algorithm is not dissimilar in spirit to the SMCS algorithm, and there is a certain duality involving the terms in which the algorithm is expressed
Given strings a and y, we define
lp(a, y) = length of the longest prefix of a that is a subsequence of 7 Given strings a, 2 of lengths m and n respectively, we define the set 7;;, 171 for each ¿ =0, ,?mn, 7—=0, ,n, by
T¡; = {(r, (, 9)) : there exists a minimal common supersequence + of d° and 6, of length z, such that /p(œ, +) = x, lp(8,7) = y}
Finally, for string œ, position ¿ and svmbol ø, we deflne
‡ otherwise
10
Trang 11The algorithm for LMCS is based on a dynamic programming scheme for the sets 7;; defined above So evaluation of 7;,,, reveals the length of an Imcs, but also finds the lengths of all minimal common supersequences of a and £ (indeed of all minimal common supersequences of all pairs of prefixes
of œ and #) Furthermore, the use of suitable tracebacks in the array of T;,; values, can be used to generate, not only an Imcs, but all minimal common supersequences
The zero’th row and column of the 7;; table can be evaluated trivially,
Theorem 2 (i) If a[t] = Bly] = a then
Ti; = ẤP; (fala, a), fay, a)) : (r —1, (x, y)) € đị 12-1}
(it) If a[t] = a 4 b= Ply] then
li; = {(r, (fala, b),7)) : (r — 1, (#;7— 1)) € địj—1}
U{(r; (2, fely, a)) : (r — 1, (2 — ], y)) € độ tự}
Proof (ï) Suppose (r— 1, (2, )) €?7;_¡ ;_¡ and that +” is a minimal common supersequence of œ~! and Ø?~! of length r — 1 with ip(œ,+) = z and
lp(B,7') = y Then it is immediate that y = 7'+ a4 is a minimal common supersequence, of length r, of a* and 6%, and that Ip(a,y) = f.(x,a) and
ip(, +) = fay a)
On the other hand, suppose that + is a minimal common supersequence
of length r of at and 67 Then y[r] = a, and 7’ = 7 — a is certainly a common supersequence of a’! and 67-1 If 7’ were not minimal, then some subsequence 6 of y’ would be a common supersequence of a’~! and (7-1, and therefore 6+ a, a subsequence of ~, would be a common supersequence
of œ* and 67, contradicting the minimality of y So (r— 1, (#,y)) € Ti-1,;-1
11