... field of approximate matching of strings, trees, graphs and regular expressions We begin our discussion in the field of approximate matching by first looking at the problem of approximate matching. .. with respect to the approximate matching of structures that are more complicated than strings, namely trees and regular expressions Approximate pattern matching of complex structures such as trees... thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem Our algorithm for approximate matching of a string S with a class
Trang 1I am extremely grateful to those who have helped me in different ways to materializethis thesis First of all, I wish to thank my supervisor, Dr St´ephane Bressan, forproviding me his extremely valuable guidance and for teaching me what research
is all about His constant motivation and deep insight have enabled me to develop
as a researcher
I sincerely thank Dr Anirban Mondal and Mr Vinsensius Vega for the dous help and support that they had extended to me I would also like to acknowl-edge Dr Ng Wee Siong, Mr Anand Ramchand, Mr Li Shiau Cheng, Mr AjayHemnani, Mr Liau Chu Yee, Mr Tok Wee Hyong, Mr Li Yingguang, Mr OngTwee Hee and all the members of Database and Electronic Commerce Laboratoriesfor their friendship and willingness to help me in various ways
tremen-I ardently wish to thank my family for their tremendous support Last, but notthe least, I sincerely thank the National University of Singapore for providing mewith the opportunity to complete my postgraduate studies
i
Trang 2Summary 1
1.1 Summary of Contributions 8
1.2 Organization of the Thesis 9
2 Related Work and Background Information 10 2.1 Approximate Matching in Strings 11
2.1.1 Definition 13
2.1.2 Problem Definition 14
2.1.3 Algorithms 17
2.1.4 Timeline 18
2.2 Approximate Matching in Trees and Graphs 19
2.2.1 Definition 19
2.2.2 Problem Definition 23
2.2.3 Algorithms 23
ii
Trang 32.2.4 Timeline 25
2.3 Approximate Matching of Strings and Regular Expressions 25
2.3.1 Definition 25
2.3.2 Problem Definition 26
2.3.3 Algorithms 27
2.3.4 Timeline 28
3 Approximate matching of trees and graphs under the degree − 1 constraint 29 3.1 The degree-1 Constraint 33
3.1.1 Edit Distance 34
3.1.2 The degree-1 concept 35
3.2 The Algorithms 37
3.2.1 Preliminaries 37
3.2.2 Ordered Tree Algorithm 38
3.2.3 Unordered Tree Algorithm 42
3.2.4 Acyclic Graph Algorithm 44
3.3 Complexity Analysis 45
3.4 Example 46
3.5 Summary 47
4 Approximate matching of strings and regular expressions 49 4.1 Problem Definition 51
4.2 Background Information 52
4.2.1 Grammars and Languages 52
4.2.2 Regular Expressions 52
4.2.3 Finite State Automata 54
Trang 44.2.4 Chomsky Hierarchy 58
4.2.5 Types of Regular Expressions 60
4.3 A Simple String to Regular Expression Pattern Matching Machine 61 4.4 Existing Algorithm - Myers and Miller 64
4.4.1 Discussion 64
4.5 Our Algorithm - RP M 67
4.5.1 Background 67
4.5.2 The Idea 69
4.5.3 The Algorithm 74
4.5.4 Examples 76
4.6 Performance Evaluation 77
4.6.1 Experimental Setup 77
4.6.2 Results 77
4.7 Summary 83
Trang 51.1 HTML Code and its tree representation 5
2.1 An alignment for S = abcaacaca and T = acaacacda 12
2.2 Edit Distance Matrix for S1 = abba and S2 = aabaca 16
2.3 Ordered Tree Example 20
2.4 UnOrdered Trees Example 20
2.5 Delete and Insert Edit operations on Trees 22
3.1 Ordered Tree Algorithm under the degree-1 constraint 39
3.2 Bipartite Graph G B = (V1, V2, E) 40
3.3 (a)Matching M B1 (b)Matching M B2 40
3.4 Unordered Tree Algorithm under the degree-1 constraint 43
3.5 Acyclic Graph Algorithm under the degree-1 constraint 44
3.6 Two Example Trees 46
3.7 Ordered Edit Distance Matrix 46
3.8 Unordered Edit Distance Matrix 47
4.1 DFA Example 56
v
Trang 64.2 NFA Example 57
4.3 Class 1 Regular Expression 60
4.4 Class 2 Regular Expression 60
4.5 Class 3 Regular Expression 60
4.6 Class 4 Regular Expression 61
4.7 Class 5 Regular Expression 61
4.8 Special Arcs 62
4.9 Transition graph for R=a*b 63
4.10 Myers representation : F a 65
4.11 Myers representation : F R|S 65
4.12 Myers representation : F RS 65
4.13 Myers representation : F R ∗ 66
4.14 Myers algorithm for approximate matching of regular expression 66
4.15 Our algorithm for approximate matching of regular expression 75
4.16 RPM Example 1 : R = ab ∗ cb and S = bbbaab 76
4.17 RPM Example 2 : R = ca ∗ ab ∗ and S = baaabb 76
4.18 RPM Example 3 : R = abb ∗ a ∗ bc and S = ccbbbc 77
4.19 Performance Analysis : Varying Length of Regular Expression,|R| 78 4.20 Performance Analysis : Varying Length of String, |S| 79
4.21 Performance Analysis : Varying size of alphabet Σ, |Σ| 80
4.22 Performance Analysis : Varying number of kleene closures, | ∗ | 81
4.23 Performance Analysis : Special Cases 82
5.1 Myers Example : R = a∗ and S = aaa 97
5.2 Myers Edit Distance Matrix 98
Trang 7Approximate pattern matching techniques in various structures such as strings,trees, graphs and regular expressions form the basis of many commercial applica-tions available today in important fields such as bio-informatics and informationextraction This thesis presents a detailed review of some of the basic and im-portant algorithms and ideas over the past 40 years in the area of approximatepattern matching In particular, we address the problem of approximate patternmatching specifically with respect to the approximate matching of structures thatare more complicated than strings, namely trees and regular expressions Ap-proximate pattern matching of complex structures such as trees and graphs andregular expressions is a primitive operation essential to applications in informationretrieval, information integration and mediation, and in many such domains thatrequire evaluating or characterizing the similarity between structured and complexobjects such as HTML documents, molecular compounds and XML data.
The main contributions of our work are two-fold First, we present new rithms for the approximate matching of trees (ordered and unordered) and acyclic
algo-graphs based on edit distance measures under the degree-1 constraint, the
impli-1
Trang 8cation being that the relevant information is located at the leaves of a tree or at
the periphery of a graph Under the degree-1 constraint edit operations can be performed only on vertices with degree≤1 The ordered and unordered tree algo- rithms have a worst-case execution time of O(|T1|.|T2|.k2log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1|2.|T2|2.k2log k) Sec-
ond, we consider the problem of approximate matching of a string with a special
type of regular expression where the kleene closure (*) is only allowed to be bound
to a single character In this regard, we present a new algorithm which exploitsthe special properties of such a regular expression, thereby enabling us to reducethe approximate regular expression matching problem to that of an approximate
string matching problem Our algorithm for approximate matching of a string S with a class 2 regular expression R, which we designate as RP M , runs in O(|S|3)time and space in the worst case Our performance evaluation indicates that ourproposed techniques indeed outperform an existing well-known algorithm for ap-proximate regular expression matching in terms of execution times This may beprimarily attributed to the approximate string matching nature of our algorithmwhich makes use of simple arithmetic operations
We plan to extend the work done in this thesis in the near future by trying toaddress more complex issues in the field of the approximate pattern matching Thisresearch effort has laid the foundation for considering the problem of approximatematching of two regular expressions, but we believe that there are still a lot ofopen research issues in this field In addition to this, we also aim to try and employmultiple sequence alignment techniques in order to derive a valid or an optimalschedule in a client-server architecture under delay constraints
Trang 9Approximate pattern matching techniques in various structures such as strings,trees, graphs and regular expressions form the basis of many important as well as di-verse commercial applications ranging from traditional applications associated withinformation extraction to more specialized applications involving bio-informatics.The World Wide Web (WWW) is growing at an exponential rate with new web-sites emerging everyday The WWW hosts and serves large amounts of documentscontaining data (primarily in textual form) pertaining to essentially all domains ofhuman activity e.g., art, education, travel, science, politics and business, thereby
making it a very large-scale distributed global information resource residing on the
Internet Notably, the information on the WWW is potentially useful for bothindividuals and businesses There is an increasing need for the convergence ofdatabase and information retrieval support to new application domains such asinformation interchange over the Internet with XML, bio-computing, distributeddirectory servers with LDAP, or the management of hypermedia over the World
3
Trang 10Wide Web At the heart of this convergence is the possibility of evaluating thesimilarity of the objects in question according to appropriate metrics Objects inthe applications mentioned above have in common their complex structure that isoften that of a tree or a graph Trees and graphs approximate matching algorithmsprovide a variety of similarity measures for these objects.
The HyperText Markup Language (HTML) is the lingua franca for publishingdata on the Web Unfortunately, HTML has been designed for display purposes,the implication being that it is primarily meant for human consumption as opposed
to machine consumption But for the WWW to reach its full potential, the datashould be defined and linked in such a way that it can be used for more effectivediscovery, automation, integration, and reuse across various applications Now, justlike people need to have agreement on the meanings of the words which they employ
in their communication, computers need mechanisms for agreeing on the semantics
of the data in order to communicate effectively The World Wide Web Consortium(W3C), in collaboration with a large number of researchers and industrial partners,
is now exploring the possibility of creating a Semantic Web, in which the meaning ismade explicit (through the use of meta-data), thereby allowing machines to processand integrate Web resources intelligently
Intuitively we can understand that documents in a markup language can ally be represented as a tree as illustrated in Figure 1.1 Being able to evaluate thesimilarity between two documents is the basis for search and integration Schemamatching is a basic problem in many database application domains, such as dataintegration, E-business, data warehousing and semantic query processing Auto-matic schema matching has become one of the key areas of research in the field
gener-of computer science due to the rapid increasing number gener-of web data sources andE-businesses to integrate As systems become able to handle more complex data-
Trang 11Figure 1.1: HTML Code and its tree representation
bases and applications their schemas become larger, further increasing the number
of matches to be performed Seeing how data in an HTML or XML document isprimarily stored at the leaves or on the periphery of the tree or graph that describes
it, we devise an algorithm that takes advantage of this property when comparingtwo different structures
Although data may be represented in various ways, text still remains the mary medium for exchanging information on the WWW This is particularly ev-ident in the domains of literature or linguistics where data are composed of hugecorpus and dictionaries This is also applicable to computer science, where a largeamount of data are stored in linear files Given such predominantly textual nature
pri-of the data residing in the WWW, efficient pattern matching algorithms and pression techniques become necessary to address semantic issues associated withthe data While the problem of pattern matching pertains to locating a specificpattern inside raw data (a pattern is usually a collection of strings described insome formal language), the aim of data compression is to provide representation ofdata in a reduced form in order to save both storage space and transmission timesuch that there is no loss of information (the compression processes are reversible)
Trang 12com-Incidentally, both pattern matching and compression techniques apply to themanipulation of texts (word editors), the storage of textual data (text compres-sion) and data retrieval systems (full text search) Additionally, they are basiccomponents used in the implementations of practical softwares existing under mostoperating systems Moreover, they emphasize programming methods that serve asparadigms in other fields of computer science (system or software design) Finally,they also play an important role in theoretical computer science by providing chal-lenging problems In this thesis, we specifically focus on the problem of patternmatching.
Pattern matching of textual data arises in several important commercial plications Sequential pattern mining, i.e., the mining of frequent subsequences
ap-as patterns in a sequence databap-ase, is an important data mining tap-ask with broadapplications Interestingly, this is also the case in molecular biology because bio-logical molecules can often be approximated as sequences of nucleotides or aminoacids Furthermore, the amount of available data in these fields tends to dou-ble every eighteen months, thereby underlining the necessity of efficient patternmatching algorithms even if the speed and storage capacity of computers increaseregularly Moreover, pattern matching is also a key part of various other appli-cations including analysis of consumer behaviors, web access patterns, processanalysis of scientific experiments, prediction of natural disasters, to mention just
a few Incidentally, ordered, labeled trees are often deployed in pattern ing Ordered, labeled trees are trees in which each vertex has a label and theleft-to-right order of its children (if any) is fixed Such trees have many applica-tions in vision, pattern recognition, molecular biology, programming compilation,and natural language processing Many of the applications involve comparing trees
match-or retrieving/extracting infmatch-ormation from a repositmatch-ory of trees Examples include
Trang 13classification of unknown patterns, analysis of newly sequenced RNA structures,semantic taxonomy for dictionary definitions, generation of interpreters for non-procedural programming languages, and automatic error recovery and correctionfor programming languages.
Multiple sequence alignment can be seen as a generalization of the pairwise
sequence alignment - instead of aligning two sequences, k sequences are aligned simultaneously, where k is any number greater than 2 Multiple sequence alignment
is particularly useful in the field of bioinformatics because it allows biologists toextract and represent biologically important but faintly/widely dispersed sequencesimilarities giving them hints about the evolutionary history of certain sequences.The problem of multiple sequence alignment has been shown to be NP-Complete
in general and is therefore not likely to be solved in polynomial time For Complete problems, there is (almost) no hope that there is an algorithm that isnot exponential in its complexity The algorithm for multiple sequence alignmenthas a time complexity of Θ(2N L N ) and a space complexity of Θ(L N) It turnsout that not all cells of the cube (for a 3 sequence case) and in general, the N-dimensional matrix need to be computed, and the order of computation can also
NP-be heuristically optimized In addition to uses in bio-informatics multiple sequencealignment could probably also be used to generate a optimal or in most cases avalid schedule given certain delay constraints in a client server architecture Wetouch on this possible application of multiple sequence alignment in the future worksections in Chapter 5
Approximate matching of regular expressions forms the basis of many searchprocedures in various applications Searching for a pattern in a text file is a verycommon operation in many applications ranging from conventional applicationssuch as text editors to more sophisticated applications in molecular biology Inci-
Trang 14dentally, schema matching is also a basic problem in many database applicationsdomains such as data integration, E-business, data warehousing and semantic queryprocessing Schema matching is usually performed manually through some form
of a graphical interface Such methods have the disadvantages of being some, time-consuming and error-prone By combining existing schema matchingtechniques with customized approximate pattern matching algorithms for regularexpressions, one can automate an otherwise largely manual operation Rahm et
cumber-al [19] survey a number of approaches to automatic schema matching
1.1 Summary of Contributions
This work focusses on problems of approximate pattern matching in complex tures such as trees, graphs and regular expressions The contributions of this thesisare two-fold
struc-• We present new algorithms for the approximate matching of trees (ordered
and unordered) and acyclic graphs based on edit distance measures under
the degree-1 constraint, the implication being that the relevant information
is located at the leaves of a tree or at the periphery of a graph Under the
degree-1 constraint edit operations can be performed only on vertices with
degree is less than or equal to 1 The ordered and unordered tree algorithms
have a worst-case execution time of O(|T1|.|T2|.k2log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1|2.|T2|2.k2log k).Our work on approximate matching of trees and acyclic graphs under the degree-1
constraint has been submitted for publication to a well known journal and
we are currently awaiting the results of the review
• We consider the problem of approximate matching of a string with a special
Trang 15type of regular expression where the kleene closure (*) is only allowed to be
bound to a single character In this regard, we present a new algorithm whichexploits the special properties of such a regular expression, thereby enabling
us to reduce the approximate regular expression matching problem to that of
an approximate string matching problem Our algorithm RP M for mate matching of a string S with a class 2 regular expression R runs in O(|S|3)time and space in the worst case Our performance evaluation indicates thatour proposed techniques indeed outperform an existing well-known algorithm
approxi-Myers [45] for approximate regular expression matching in terms of execution
times This may be primarily attributed to the approximate string ing nature of our algorithm which makes use of simple arithmetic operations,
match-whereas Myers needs to first construct a regular expression edit graph before
it starts traversing the edges of this graph to provide the desired result
1.2 Organization of the Thesis
The remainder of the thesis is organized as follows:
• Chapter 2 provides an overview of existing works in the field of approximate
matching of strings, trees, graphs and regular expressions
• Chapter 3 presents our algorithm for approximate matching of ordered and
unordered trees and acyclic graphs under the degree-1 constraint.
• Chapter 4 discusses the work we have done in the area of string to regular
expression matching in a special case and we compare our algorithm with anexisting string to regular expression approximate matching algorithm
• Finally, we conclude in Chapter 5 with directions for future work.
Trang 16Related Work and Background
Information
This chapter describes existing works in the field of approximate matching ofstrings, trees, graphs and regular expressions We begin our discussion in the field
of approximate matching by first looking at the problem of approximate matching
of strings in Section 2.1 We present the notion of edit operations on strings andedit distance with respect to strings and introduce the edit distance matrix, a vi-sualization which is central to many exact and approximate matching algorithms
We then present a brief survey on some of the key exact and approximate matchingalgorithms on strings In Section 2.2, we discuss the issue of approximate matching
in more complex structures like trees and graphs This is a relatively new area
of research in approximate matching when compared to strings We then discusssome of the existing work done on different types of tree and graph structures
In Section 2.3, we aim to give a brief overview of the work done on approximate
10
Trang 17matching of strings with regular expressions.
2.1 Approximate Matching in Strings
The string matching problem is the most studied problem in algorithmics on wordsand there are many algorithms for solving this problem efficiently In practical
pattern-matching applications, exact matching is not always relevant It is often
more important to find objects that match a given pattern in a reasonably imate manner, i.e., allowing some errors
approx-Approximate string matching consists in finding all approximate occurrences of
pattern x in text y There exists a number of methods to compare two strings or
sequences One of the most common ways is the notion of similarity between two
strings A similarity measure is a function that associates a numeric value with
a pair of sequences, with the idea that a higher value indicates greater similarity
A similarity measure can have both positive and negative values depending on theproperties of the scoring function used
The notion of distance is somewhat dual to similarity It treats sequences as
points in a metric space A distance measure is a function that also associates
a numeric value with a pair of sequences, but with the idea that the larger thedistance, the smaller the similarity, and vice versa Distance measures usuallysatisfy the mathematical axioms of a metric In particular, distance values are
never negative Approximate occurrences of x are segments of y that are close to x according to a specific distance: their distance to x must not be greater than a given integer k There are two standard distances: the hamming distance and the edit distance The Hamming distance, H is defined only for strings of the same length For two strings S and T , H(S, T ) is the number of places in which the two string
Trang 18differ, i.e., the places where they have different characters The Hamming distance
is related to the number of mismatches between the pattern and its approximateoccurrences This problem is also called the approximate string matching with
k mismatches For example, if S = aababb and T = bbbaba then H(S, T ) = 3
corresponds to the mismatches in positions 1,2 and 6 in the strings The edit distance between two strings is the minimal number of edit operations (insert,
delete and change) needed to transform one string into the other It is defined for
strings of arbitrary length For example, if S = aab and T = accab, the minimum number of edit operations required to transform S to T is 2 corresponding to the deletion of the 2 c’s There could be more than one sequence of operations to
transform one string into the other If each operation is assigned the same cost,this is also known as the Levenshtein distance We shall describe in detail properties
of the edit distance between strings later in this section
The longest common subsequence (LCS) problem is a particular case of the edit distance problem in strings Given two strings S and T of length n and m respectively, if l = LCS(S, T ), then one can transform S to T by first deleting the
n − l characters of S (all but those of a longest common subsequence) and then
inserting m−l symbols to get T For example, if S = abbcc and T = acb with n = 5 and m = 3, LCS(S, T ) = ab or ac with l = 2 We first delete 3 (5-2) non-LCS characters from S, bcc and then insert 1 (3-2) non-LCS characters from T , c in the correct position to obtain T from S.
S:
T:
b _
a a
c c
a a
c c
c c
a a
a a
_ d
a a
Figure 2.1: An alignment for S = abcaacaca and T = acaacacda
Trang 19Another way of representing the differences (or similarity) between two strings(or sequences), which is one of the central concepts in bio-informatics, is the notion
of alignment An alignment is a mutual arrangement of two sequences such that
it exhibits where the two sequences are similar, and where they differ An optimal alignment is understandably one that exhibits the most correspondences, and the
least differences The alignment problem is another interesting variation of the edit
distance problem where gaps or ‘empty’ strings are inserted in each of the two
strings such that common characters are matched, whereas characters ‘unmatched’can be inserted or deleted Figure 2.1 depicts a possible alignment for strings
S = abcaacaca and T = acaacacda which corresponds to a deletion of b in S and
an insertion of d in T
Before we move on to the major algorithms on string matching, we shall firstpresent a definition of a string proposed by Crochemore and Rytter(1994)[18]
Let Σ be an input alphabet - a finite set of symbols Elements of Σ are called
characters A string over Σ is defined as a finite sequence of elements of Σ The length of a string S, |S|, is defined as the number of elements (with repetitions) in
the string S Therefore the length of abbab is 5.
The i th element of the string S is denoted by S[i] and i is its position in S.
For example, the 4th character in the string pattern is S[4] = ‘e’ A substring of
S, denoted by S[i j] is the sequence of elements S[i]S[i + 1] S[j] in S For
example, pat is a substring of pattern A string S seq is a subsequence of S if S seq can be obtained from S by removing zero or more (not necessarily adjacent) letter from it For example, pen is a valid subsequence of pattern Intuitively, S seq is
a subsequence of S if S seq = S[i1][i2] [i m ] where i1, i2, , i m is an increasing
Trang 20sequence of indices in S.
Given two strings, S1 and S2 of length m and n (m ≤ n) respectively, the very basic form of the exact string matching problem is a membership decision problem, i.e verify if S1 occurs in S2 The output is a boolean value S1 is either a member
of S2 or it isn’t As mentioned earlier in Section 2.1, very often this is not very
helpful A more interesting scheme would be to see how far S1 is from S2
The Approximate String Matching problem is defined as the problem of forming S1to S2 via a series of edit operations Now we shall discuss edit operations
trans-on strings
Edit Operations in Strings
When two strings S1 and S2do not exactly match, there exists errors corresponding
to the differences between the two strings Let ∅ represent an empty structure An edit operation can be represented as a pair, (u, v) 6= (∅, ∅) sometimes written u → v.
There are three kinds of edit operations:
1 change: symbols at corresponding positions are different A change tion is represented by (u, v) where u 6= ∅ and v 6= ∅.
opera-2 insert : a symbol of S2 is missing in S1 at a corresponding position An
insert operation is represented by (u, v) where u = ∅ and v 6= ∅.
3 delete : a symbol of S1 is missing in S2 at a corresponding position A delete
operation is represented by (u, v) where u 6= ∅ and v = ∅.
Trang 21Edit Distance between Strings
One would always look for the best way to transform S1 to S2, i.e., the minimum
number of differences between S1 and S2 This can be translated as the smallest
number of edit operations (change, insertion and deletion) to transform S1 to S2
This is called the edit distance between S1 to S2 and is denoted by δ(S1, S2).Three properties are satisfied at all times They are as follows:
• δ(S1, S2) = 0 if f S1 = S2 : If both the strings are the same the minimumdistance between them is 0 as no edit operation is required to transform oneinto the other
• δ(S1, S2) = δ(S2, S1) : A fundamental property of the edit distance is that
it is symmetric This comes from the duality between the deletion and the
insertion operations A deletion of a character a in S1 in order to get S2corresponds to an insertion of a in S2 to get S1
• δ(S1, S2) ≤ δ(S1, S3) + δ(S3, S2) (triangle inequality)
Central to many of the algorithms based on the edit distance scheme is the edit
distance matrix Assuming the strings S1 and S2 are of fixed length m and n such that n ≥ m Each cell in the edit distance matrix, EDIT , has a value equal to
δ(S1[1 i], S2[1 j]), with 0 ≤ i ≤ m and 0 ≤ j ≤ n The boundary values of
EDIT are defined as follows :
for 0 ≤ i ≤ m, 0 ≤ j ≤ n, EDIT [0, j] = j; EDIT [i, 0] = i;
Rest of the elements of EDIT can be computed using the simple formula
Trang 22EDIT [i, j] = min
(i − 1, j − 1) is connected to three other vertices: (i − 1, j), (i, j − 1), (i, j) when
i ≤ m, j ≤ n, the edit distance between strings S1 and S2 equals the length of a
least weighted path in this graph from source (0, 0) to the sink (m, n) An edge from (i − 1, j − 1) to (i, j − 1) represents a deletion edge and has a discrete integer cost assigned to it represented by cost(delete) An edge from (i−1, j −1) to (i−1, j)
represents a insertion edge and has a discrete integer cost assigned to it represented
by cost(insert) An edge from (i − 1, j − 1) to (i, j) represents a replacement edge and is represented by cost(change )=δ(S1[i], S2[j]).
Figure 2.2: Edit Distance Matrix for S1 = abba and S2 = aabaca
Figure 2.2 shows the edit distance matrix for strings S1 = abba and S2 = aabaca.
We assume each edit operation (delete, insert and change) has a cost of 1 The
minimum edit distance, δ(S1, S2) is represented by EDIT [4][6] = 3 This reflects the change operation b → a and the insertion of c and a There are several paths from source to sink in this case One possible path is EDIT [0][0] → EDIT [1][1] →
Trang 23EDIT [2][2] → EDIT [3][3] → EDIT [4][4] → EDIT [4][5] → EDIT [4][6]
The naive exact matching algorithm in strings locates all occurrences in time
O(nm) But hashing provides a simple method that avoids the quadratic ber of symbol comparisons in most practical situations, and that runs in lineartime under reasonable probabilistic assumptions (Harrison (1971)[25] and Karpand Rabin (1987)[31]) The two most famous exact string matching algorithmswere devised by algorithms Morris and Pratt (1970)[30] and Boyer and Moore(1977)[9] The first linear-time string-matching algorithm was developed by Morrisand Pratt (1970) It was improved by Knuth, Morris, and Pratt (1976)[34] The
num-algorithm’s preprocessing phase computes in O(m) space and time complexity and its searching phase computes in O(n + m) time complexity (independent from the
alphabet size) The Boyer and Moore’s algorithm (1977) is considered as the mostefficient string-matching algorithm in usual applications The Boyer-Moore algo-
rithm consists of preprocessing phase which computes in O(m + σ) time and space complexity and a searching phase which computes in O(mn) time complexity A
simplified version of it (or the entire algorithm) is often implemented in text editorsfor the “search” and “substitute” commands Several variants of Boyer and Moore’salgorithm avoid the quadratic behavior when searching for all occurrences of thepattern The most efficient solutions in terms of number of symbol comparisonshave been designed by Apostolico and Giancarlo (1986)[5] and Colussi (1994)[15].Although the idea of approximate pattern matching is ubiquitous in informa-tion processing, it first clearly appeared in the earlier work on approximate stringmatching Wagner and Fisher (1974)[65] define an edit distance between two stringsand an algorithm for its computation The distance between two strings is given
Trang 24by the minimum number of operations: insertion of a letter, deletion of letter or
change of a letter, required to transform one string into the other The authors
present an algorithm which runs in O(nm) time where each cell in the matrix is the
minimum cost of a delete operation on the cell to the top of it, an insert operation
on the cell to the left of it and a change operation on the immediate cell to its left
diagonal The edit distance between the two strings is represented in the final cell
of the two dimensional matrix This is the unit edit distance Different operations
can be assigned a different weight
The notion of a longest common subsequence (LCS) of two strings is widely
used to compare files The diff command of UNIX system implement an algorithm
based on the notion that lines of the files are considered as symbols Informally, the
result of a comparison gives the minimum number of operations (insert a symbol,
or delete a symbol) to transform one string into the other The comparison of
molecular sequences is basically done with a closed concept, alignment of strings,
which consists in aligning their symbols on vertical lines This is related to an
edit distance, called the Levenshtein distance, with the additional operation of
substitution, and with weights associated to operations Hirschberg (1975)[27]
presents the computation of the LCS in linear space Aho, Hirschberg and Ullman
(1976)[3] show that unless a bound on the total number of distinct symbols is
assumed, every solution to the problem can consume an amount of time that is
proportional to the product of the lengths of the two strings
In this section we present a time line of some of the major exact and approximate
string matching algorithms
[30]1970 →[25]1971 →[65]1974 →[27]1975 →[3]1976 →[9, 34]1977 →[5]1986 →[31]1987 →[15]1994
Trang 252.2 Approximate Matching in Trees and Graphs
Approximate pattern matching of complex structures such as trees and graphs is
a primitive operation essential to applications in information retrieval, tion integration and mediation, and in many such domains that require evaluating
informa-or characterizing the similarity between structured and complex objects such asHTML documents, molecular compounds and XML data
A tree is a special subset of a graph A tree is a graph which contains no cycles
We can visualize a tree by drawing it with a root at the top with the vertices belowleading to the leaves at the lowest If the vertices are placed on levels, higher levelvertices are referred to the parents of the vertices directly below them, while the
lower vertices are similarly referred to as their children A tree with n vertices has n − 1 edges Although maybe not part of the widest definition of a tree, a
common constraint is that no vertex can have more than one parent Moreover,for some applications, it is necessary to consider a vertex’s daughter vertices to be
an ordered list, instead of merely a set As a data structure in computer programs,
trees are used in everything from B − trees in databases and file systems, to game
trees in game theory, to syntax trees in a human or computer languages
Trang 26An ordered tree is a tree in which the relative order of the subtrees meeting
at each vertex must be preserved, i.e the left to right order of children of everyvertex matters
7 a
Figure 2.3: Ordered Tree Example
Figure 2.3 shows an ordered tree where each vertex is labeled with a characterand the vertices post order number adjacent to the vertex
An unordered tree is a tree in which the relative order of the subtrees meeting
at each vertex need not be preserved, i.e., the left-to-right order of the children of
every vertex does not matter.
Figure 2.4: UnOrdered Trees Example
Figure 2.4 shows two vertex labeled trees which are exactly the same if theyare considered to be unordered The left-to-right ordering of the children does notmatter in the unordered case
Given a tree, it is usually convenient to use a numbering scheme to refer to the
vertices of the tree For an ordered tree T , the left-to-right postorder numbering
Trang 27or left-to-right preorder number are often used to number the vertices of T from 1
to |T |, the size of the tree T For an unordered tree, we can fix an arbitrary order
for each of the vertices in the tree and then use left-to-right postorder numbering
or left-to-right preorder numbering Suppose that we have a numbering for each
tree Let t[i] be the i th vertex of tree T in the given numbering We use T [i] to denote the subtree rooted at t[i] Let t[i1], t[i2], , t[i ni ] be the children of t[i] The
interesting property of the postorder numbering scheme is that children of a parent
are always assigned a number lower than that of the parent This fits in perfectly
as in many dynamic algorithms it is crucial for the children to be processed firstbefore the parent
Edit Operations in Trees and Graphs
There are three kinds of edit operations in trees and graphs:
1 relabel : Relabeling a vertex x means changing the label on x.
2 delete : Deleting a vertex x means making the neighbors of x (except an bitrarily specified neighbor x 0 ) become the neighbors of x 0 and then removing
Trang 28(a) Delete vertex
+
(b) Insert vertex
Figure 2.5: Delete and Insert Edit operations on Trees
A neighbor of a character s in a string is the characters s 0 and s” on either side
of s, where s 0 and/or s” may or may not be empty A neighbor of a vertex x in
a tree or a graph is any vertex x 0 that is directly connected to x by a single edge.
Although we are only concerned with inserts and deletes for our algorithms, wedescribe the relabeling operation for the sake of completeness
Edit Distance between Trees/Graphs
Given two graphs G1 and G2, there are several methods of performing mate pattern matching between the two structures One way is to measure the
approxi-edit distance, i.e the minimum cost of transforming one structure into the other
quite often through a series of edit operations, i.e deletion of a vertex in G1,
in-sertion of a vertex in G2 and the relabeling of a vertex in G2 with the label of a
vertex in G1 The edit distance between any two graphs G1 and G2 is denoted
by δ(G1, G2) Each edit operation can be assigned a numeric cost (not necessarilydistinct) The edit distance is in fact a distance metric as it satisfies the basic rules
of symmetry and triangle inequality
Trang 292.2.2 Problem Definition
Approximate tree matching is a generalization of approximate string matching.Given two trees, we view one tree as the pattern tree and the other as the datatree The idea is to match approximately the pattern tree to the data tree Given
two trees T1 and T2, the task of transforming T1 to T2 or T2 and T1 via a sequence
of edit operations is termed as the problem of approximate pattern matching intrees
Several definitions and algorithms have been given for the approximate matching
of graphs and trees They correspond to different data structures (ordered trees,unordered trees, graphs, etc), different notions of similarity or distance, and differ-ent constraints Tai (1979) [60] was one of the first authors to work on the topic ofapproximate pattern matching of trees He gave the definition of the edit distancebetween ordered, labeled trees and the first non-exponential algorithm to compute
it Tai used a pre-order numbering scheme to number the trees The convenient
aspect of this notation is that for any i, 1 ≤ i ≤ |T |, vertices from T [1] to T [i] is
a tree rooted at T [1] He incorporated the same approach as sequence editing and came up with an algorithm that runs in O(|T1|.|T2|.depth(T1)2.depth(T2)2) timeand space Lu (1979) [40] presents another algorithm on ordered trees based on
the edit operations presented by Tai Let t1[i1], t1[i2], , t1[i ni] be the children of
t1[i] and t2[j1], t2[j2], , t2[j nj ] be the children of t2[j] Lu considers the following three cases (1) t1[i] is deleted - in this case the distance would be to match T2[j] to one of the subtrees of t1[i] and then to delete all the rest of the subtrees, (2) t2[j] is inserted - in this case the distance would be to match T1[i] to one of the subtrees
of t2[j] and then to insert all the rest of the subtrees, (3) t1[i] matches t2[j] - in
Trang 30this case, consider the subtrees t1[i1], t1[i2], , t1[i ni ] and t2[j1], t2[j2], , t2[j nj] astwo sequences and each individual subtree as a whole entity He then uses the se-
quence edit distance to determine the distance between t1[i1], t1[i2], , t1[i ni] and
t2[j1], t2[j2], , t2[j nj] This algorithm considers each subtree as a whole entity It
does not allow one subtree of T1 to map to more than one subtree of T2
Kilpelainen and Mannila (1995)[32] introduced the tree inclusion problem i.e.,
given a pattern tree P and a target tree T , tree inclusions asks whether it is possible
to obtain P by strictly deleting vertices of T Both ordered and unordered trees are considered Since there may be exponentially many ordered embeddings of P to T , they assume that P and T have the same label, their algorithm tries to embed P into T by embedding the subtrees of P as deeply and as far to the left as possible The time complexity of their algorithm is O(|T1|.|T2|) and they showed that the
unordered inclusion problem is NP-Complete
Shasha and Zhang(1990,1991) [57, 71], define an edit distance for ordered beled trees and propose algorithms for its computation Jiang, Wang and Zhang(1994)[29] address the problem of approximate matching in ordered labeled trees
la-by inserting empty vertices to align the structure of ordered labeled trees Thedistance is defined as the sum of the score of the opposing labels after the struc-turally similar graphs are overlayed Shasha et al (1994) [56] propose severalenumerative algorithms for the approximate matching of unordered labeled trees.The algorithms are based on probabilistic hill climbing and bipartite graph match-ing [10, 22] The authors Luccio and Pagli (1991) [41], present algorithms for theapproximate matching of H-ary trees and arbitrary ordered trees Wang, Zhangand Chirn (1994) [67] define an edit distance for labeled graphs and describe an
algorithm they call inexact graph matching in terms of finding a minimal
transfor-mation cost between the graphs Vilares, Ribadas and Grana (2001) [63] present
Trang 31a proposal intended to demonstrate the applicability of tabulation techniques fordetecting approximately common patterns when dealing with structures sharingsome common parts based on approximate pattern matching of two ordered la-beled trees Finally, Zhang, Wang and Shasha (1995) [72] introduce the notion of
degree-2 constraint to define an edit distance between undirected acyclic graphs
and propose algorithms for its computation
in a text file is a very common operation in many applications ranging from texteditors and databases to applications in molecular biology
Following [1], regular expressions and the strings they match recursively can besummed up as
Trang 321 |()* are metacharacters
2 A non-metacharacter a is a regular expression that matches the string a
3 If r1 and r2 are regular expressions, then (r1|r2) is a regular expression that
matches any string matched by either r1 or r2
4 If r1 and r2 are regular expressions, then (r1)(r2) is a regular expression that
matches any string of the form xy, where r1 matches x and r2 matches y.
5 If r is a regular expression, then (r)* is a regular expression that matches any string of the form x1, x2, , x n , n ≥ 0, where r matches x i for 1 ≤ i ≤ n (r)* also matches the empty string, represented by ² * is also known as the
Kleene Closure operator
6 If r is a regular expression, then (r) is a regular expression that matches the same string as r.
The notation of regular expressions arises naturally from the mathematical sult of Kleene [33] that characterizes the regular sets as the smallest class of sets
re-of strings which contains all finite sets re-of strings and which is closed under theoperations of union, concatenation and “Kleene Closure”
Given a string S and a regular expression R, the problem of approximate matching
of a string and a regular expression is to find a string S R ∈ L(R), where L(R) is
the language defined by the regular expression R, such that the difference (editing distance) between S and S R is the least
Formally, ∆(R, S) = min S R ∈L(R) δ(S, S R)
Trang 332.3.3 Algorithms
The bulk of the research on regular expressions has been done on checking to see
if an input string belongs to a regular expression and in some cases how “far” thestring is from being a member of the regular expression Waterman (1984)[68] re-views several mathematical methods for comparison of nucleic acid sequences Hediscusses the problem of comparison of several sequences which is a slight simplifi-cation of the regular expression matching problem Thompson (1968) [61] describes
a regular expression recognition technique where each character in the text to besearched is examined in a sequence against a list of possible current characters.During the examination a new list of all possible next characters is built Whenthe end of the current list is reached a new list becomes the current list the nextcharacter is obtained and the process continues Wagner (1974) [64] presents anerror correction algorithm that acts as a preprocessor of sorts which accepts thepossibly illegal source string and translates that source string into a guaranteed
syntactically legal string based on the minimum edit distance between a string B belonging to a given regular language L which is “nearest” (in number of edit oper- ations) to a given input string α Knight et al (1995) [32] delve into the problem
of approximate pattern matching of regular expressions with concave gap penalties
and presents an O(MP (logM +log2P )) algorithm for its computation where M and
P is the size of the input string and regular expression respectively The concave
gap penalty scheme is a symbol independent gap-cost model where the cost of the
gap is solely a function of its length Myers (1992) [44] presents a O(P N/logN ) where P is the length of a regular expression R and N is the length of the word
A to determine if A is in the language denoted by R The algorithm is based on a log N speedup of the standard O(P N ) time simulation of R 0 s NDFA on A using
a combination of node-listing and ”Four-Russians” [6] paradigms Eppstein et al
Trang 34(1993) [21] look into the problem of sequence alignment and the prediction of RNAsecondary structure and present a common solution based on a common structurewhich can be expressed as system of dynamic programming recurrence equations.Myers et al (1989)[45] presents an algorithm to find a sequence matching a regu-
lar expression R whose optimal alignment with A is the highest scoring of all such sequences in O(MN ) time where M and N are the lengths of A and R respectively.
In this section, we present a timeline on some of the popular algorithms on imate matching of regular expressions
approx-[61]1968 → [64]1974 → [68]1984 → [45]1989 → [44]1992 → [21]1993 → [32]1995
Trang 35Approximate matching of trees and
graphs under the degree − 1 constraint
Approximate pattern matching of complex structures such as trees and graphs is
a primitive operation essential to applications in information retrieval, tion integration and mediation, and in many such domains that require evaluating
informa-or characterizing the similarity between structured and complex objects such asHTML documents, molecular compounds and XML data For example, as moreand more autonomous organizations produce and exchange XML data, the XMLdocuments interchanged would be prone to spelling errors, syntactic or structuraldiscrepancies as well as other syntactic or semantic differences RDF [50] descrip-tions can be represented as acyclic graphs The ability to evaluate the similaritybetween two documents is the basis for search and integration Approximate pat-tern matching can also be used in the area of schema matching Automatingschema matching remains one of the challenging tasks in semi-structured data re-
29
Trang 36search Schema matching is a key operation for many applications including dataintegration, schema integration and semantic query processing.
Approximate pattern matching in trees play a very important part in mation Extraction techniques The traditional approach for extracting data from
Infor-Web source is to write specialized programs called wrappers, that identify data of
interest and map them to some suitable format, for instance, XML or relational bles There are several existing approaches to web data extraction One of the firstinitiatives for addressing the problem of wrapper generation was the development
ta-of languages specially designed to assist users in constructing wrappers Some ta-ofthe best known tools to adopt this approach are Minerva [17] and TSIMMIS [24].Some tools, like W4F[53] and XWRAP[38], rely on the inherent structural features
of HTML documents for accomplishing data extraction by converting a HTML ument into a parsing tree, a representation that reflects its tag hierarchy Thereexists tools, like RAPIER [11] and WHISK[59], which take advantage of NaturalLanguage Processing (NLP) techniques such as filtering, part-of-speech tagging andlexical semantic tagging to build relationship between phrases and sentences ele-ments so that extraction rules can be derived Other tools rely solely on formattingfeatures that implicitly depict the structure of the pieces of data found which make
doc-it more sudoc-itable for HTML documents Examples of such tools are WIEN[36] andSTALKER[43] More information on the variety of wrappers available today can
be found in the survey of web data extraction tools by Laender et al [37]
Several tools [8, 46, 26] are available to assist users in tracking when web pageshave changed Liu, Pu and Tang (2000) [39] present WebCQ, a prototype systemfor large-scale web information monitoring and delivery The WebCQ system con-sists of four main components: a change detection robot that discovers and detectschanges, a proxy cache service that reduces communication traffics to the original
Trang 37information servers, a personalized presentation tool that highlights changes tected by WebCQ sentinels, and a change notification service that delivers freshinformation to the right users at the right time The change detection and sum-marization phases makes use of a scheme which merges the two documents (beforeand after change) by summarizing all the common, new and deleted materials inone document as it is done in HTMLDiff and the UNIX diff command [62].
de-Automatic schema matching has become one of the key areas of research inthe field of computer science due to the rapidly increasing number of web datasources and E-businesses to integrate Most work on schema matching has beenmotivated by schema integration [47, 58] - given a set of independently developedschemas, construct a global view Schema matching is also useful in applicationsbeing considered for the semantic web [7], such as mapping messages betweenautonomous agents A somewhat different scenario is semantic query processing[66, 52] - a run-time scenario where a user specifies the output of a query and thesystem figures out how to produce that output
A significant amount of work has been done on comparison of conceptual graphsrepresenting knowledge elements In [69, 70], the authors address the task of ap-proximate matching of knowledge elements and present an algorithm for its com-parison by measuring the similarity between two texts represented as conceptualgraphs Change detection and monitoring techniques of web pages on the internethave been around for sometime now and are constantly evolving based on differentconstraints There exist commercial tools, [8], which inform users of when webpages are changed In [39], the authors present WebCQ, a prototype system forlarge-scale Web information monitoring and delivery It is designed to discover anddetect changes to the web pages efficiently and to provide a personalized notification
of what and how web pages of interest have been changed In [49] the authors show
Trang 38the feasibility of automatically extracting data from web pages by using mate matching techniques This can be applied to generate automatic wrappers or
approxi-to notify/display web page differences, web page change moniapproxi-toring, etc In [51]the authors present an approach which collects a couple of example objects fromthe user and uses this information to extract new objects from semi-structureddata from web sources In each of the technologies mentioned above approximatematching of complex structures such as trees and graphs form an integral part.There are several methods of performing approximate pattern matching between
two or more structures One way is to measure the edit distance, i.e., the cost
of transforming one structure into the other, quite often through a series of editoperations
Depending on the requirements of the application and the type of the distancemeasure required, various constraints can be placed on the calculation of edit dis-tance For instance, HTML or XML documents share the property that the actualvalues carrying the information is most often at the leaves of the tree while innervertices represent the structural component of the document Therefore one could
require that they only be modified at the leaves This is the concept of degree-1
constraint presented in this chapter
In this regard, we shall focus on finding the edit distance between two plex structures such as ordered and unordered trees and acyclic graphs, under the
com-degree-1 constraint Under this constraint, edit operations can only be performed at
the leaf level of the tree or at the periphery of a graph The work in [72] addresses
the problem of comparing connected, undirected, acyclic and labeled graphs (CUAL
Graphs) In view of the challenge associated with the problem of finding the edit
distance between two CUAL graphs, proven to be NP-Complete, they propose a constrained distance metric, called the degree-2 distance which requires that any
Trang 39node to be inserted or deleted have no more than two neighbors Their algorithm
runs in time O(N1N2D2) and in O(N1N2D √ D log D) where D = mind1, d2 and
d i is the maximum degree of G i The degree-1 constraint we describe in this text also serves to simplify the problem of finding the edit distance between two CUAL
graphs We argue the relevance of such a constraint which requires that edit erations can only be performed at the leaf level of a tree or at the periphery of agraph in practical situations
op-We describe the concept of edit distance under the degree-1 constraint in
Sec-tion 3.1 In SecSec-tion 3.2, we present three algorithms to evaluate the edit distance
between the complex structures under the degree-1 constraint In Section 3.3 we
analyze the time complexity of the three algorithms A simple example is presented
in Section 3.4 We then conclude this chapter with a summary in Section 3.5
3.1 The degree-1 Constraint
There are not only different notions of similarity, distance, and approximate ing corresponding to different data structures, but also there are different suchnotions for the same data structure corresponding to different needs The respec-tive efficiency of the algorithms computing the unit weight edit distance, the edit
match-distances defined under the degree-2 constraint and the edit distance we propose under the degree-1 constraint and others are not comparable since the notions cor-
respond to different needs for different applications Their effectiveness can only
be discussed in light of the requirements of the application
The degree of a vertex x in a CUAL structure is defined to be the number
of vertices directly connected to x by means of an edge Since the algorithms
presented in this text are primarily concerned with trees and acyclic graphs, the
Trang 40definition of degree does not allow for self loops and multiple edges between two
vertices Before describing the degree-1 constraint we first clarify the notion of edit
distance and edit operations
Edit distance is defined to be the minimum number of edit operations required to
transform one structure to another, be it a string, a tree or a graph There arethree kinds of edit operations in trees and graphs:
1 relabel : Relabeling a vertex x means changing the label on x.
2 delete : Deleting a vertex x means making the neighbors of x (except an bitrarily specified neighbor x 0 ) become the neighbors of x 0 and then removing
A neighbor of a character s in a string is the characters s 0 and s” on either side
of s, where s 0 and/or s” may or may not be empty A neighbor of a vertex x in
a tree or a graph is any vertex x 0 that is directly connected to x by a single edge.
Although we are only concerned with inserts and deletes for our algorithms, wedescribe the relabeling operation for the sake of completeness