Approximate matching of complex structures

... field of approximate matching of strings, trees, graphs and regular expressions We begin our discussion in the field of approximate matching by first looking at the problem of approximate matching. .. with respect to the approximate matching of structures that are more complicated than strings, namely trees and regular expressions Approximate pattern matching of complex structures such as trees... thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem Our algorithm for approximate matching of a string S with a class

Trang 1

I am extremely grateful to those who have helped me in different ways to materializethis thesis First of all, I wish to thank my supervisor, Dr St´ephane Bressan, forproviding me his extremely valuable guidance and for teaching me what research

is all about His constant motivation and deep insight have enabled me to develop

as a researcher

I sincerely thank Dr Anirban Mondal and Mr Vinsensius Vega for the dous help and support that they had extended to me I would also like to acknowl-edge Dr Ng Wee Siong, Mr Anand Ramchand, Mr Li Shiau Cheng, Mr AjayHemnani, Mr Liau Chu Yee, Mr Tok Wee Hyong, Mr Li Yingguang, Mr OngTwee Hee and all the members of Database and Electronic Commerce Laboratoriesfor their friendship and willingness to help me in various ways

tremen-I ardently wish to thank my family for their tremendous support Last, but notthe least, I sincerely thank the National University of Singapore for providing mewith the opportunity to complete my postgraduate studies

i

Trang 2

Summary 1

1.1 Summary of Contributions 8

1.2 Organization of the Thesis 9

2 Related Work and Background Information 10 2.1 Approximate Matching in Strings 11

2.1.1 Definition 13

2.1.2 Problem Definition 14

2.1.3 Algorithms 17

2.1.4 Timeline 18

2.2 Approximate Matching in Trees and Graphs 19

2.2.1 Definition 19

2.2.3 Algorithms 23

ii

Trang 3

2.2.4 Timeline 25

2.3 Approximate Matching of Strings and Regular Expressions 25

2.3.1 Definition 25

2.3.3 Algorithms 27

2.3.4 Timeline 28

3 Approximate matching of trees and graphs under the degree − 1 constraint 29 3.1 The degree-1 Constraint 33

3.1.1 Edit Distance 34

3.1.2 The degree-1 concept 35

3.2 The Algorithms 37

3.2.1 Preliminaries 37

3.2.2 Ordered Tree Algorithm 38

3.2.3 Unordered Tree Algorithm 42

3.2.4 Acyclic Graph Algorithm 44

3.3 Complexity Analysis 45

3.4 Example 46

3.5 Summary 47

4 Approximate matching of strings and regular expressions 49 4.1 Problem Definition 51

4.2 Background Information 52

4.2.1 Grammars and Languages 52

4.2.2 Regular Expressions 52

4.2.3 Finite State Automata 54

Trang 4

4.2.4 Chomsky Hierarchy 58

4.2.5 Types of Regular Expressions 60

4.3 A Simple String to Regular Expression Pattern Matching Machine 61 4.4 Existing Algorithm - Myers and Miller 64

4.4.1 Discussion 64

4.5 Our Algorithm - RP M 67

4.5.1 Background 67

4.5.2 The Idea 69

4.5.3 The Algorithm 74

4.5.4 Examples 76

4.6 Performance Evaluation 77

4.6.1 Experimental Setup 77

4.6.2 Results 77

4.7 Summary 83

Trang 5

1.1 HTML Code and its tree representation 5

2.1 An alignment for S = abcaacaca and T = acaacacda 12

2.2 Edit Distance Matrix for S1 = abba and S2 = aabaca 16

2.3 Ordered Tree Example 20

2.4 UnOrdered Trees Example 20

2.5 Delete and Insert Edit operations on Trees 22

3.1 Ordered Tree Algorithm under the degree-1 constraint 39

3.2 Bipartite Graph G B = (V1, V2, E) 40

3.3 (a)Matching M B1 (b)Matching M B2 40

3.4 Unordered Tree Algorithm under the degree-1 constraint 43

3.5 Acyclic Graph Algorithm under the degree-1 constraint 44

3.6 Two Example Trees 46

3.7 Ordered Edit Distance Matrix 46

3.8 Unordered Edit Distance Matrix 47

4.1 DFA Example 56

v

Trang 6

4.2 NFA Example 57

4.3 Class 1 Regular Expression 60

4.8 Special Arcs 62

4.9 Transition graph for R=a*b 63

4.10 Myers representation : F a 65

4.11 Myers representation : F R|S 65

4.12 Myers representation : F RS 65

4.13 Myers representation : F R ∗ 66

4.14 Myers algorithm for approximate matching of regular expression 66

4.15 Our algorithm for approximate matching of regular expression 75

4.16 RPM Example 1 : R = ab ∗ cb and S = bbbaab 76

4.17 RPM Example 2 : R = ca ∗ ab ∗ and S = baaabb 76

4.18 RPM Example 3 : R = abb ∗ a ∗ bc and S = ccbbbc 77

4.19 Performance Analysis : Varying Length of Regular Expression,|R| 78 4.20 Performance Analysis : Varying Length of String, |S| 79

4.21 Performance Analysis : Varying size of alphabet Σ, |Σ| 80

4.22 Performance Analysis : Varying number of kleene closures, | ∗ | 81

4.23 Performance Analysis : Special Cases 82

5.1 Myers Example : R = a∗ and S = aaa 97

5.2 Myers Edit Distance Matrix 98

Trang 7

Approximate pattern matching techniques in various structures such as strings,trees, graphs and regular expressions form the basis of many commercial applica-tions available today in important fields such as bio-informatics and informationextraction This thesis presents a detailed review of some of the basic and im-portant algorithms and ideas over the past 40 years in the area of approximatepattern matching In particular, we address the problem of approximate patternmatching specifically with respect to the approximate matching of structures thatare more complicated than strings, namely trees and regular expressions Ap-proximate pattern matching of complex structures such as trees and graphs andregular expressions is a primitive operation essential to applications in informationretrieval, information integration and mediation, and in many such domains thatrequire evaluating or characterizing the similarity between structured and complexobjects such as HTML documents, molecular compounds and XML data.

The main contributions of our work are two-fold First, we present new rithms for the approximate matching of trees (ordered and unordered) and acyclic

algo-graphs based on edit distance measures under the degree-1 constraint, the

impli-1

Trang 8

cation being that the relevant information is located at the leaves of a tree or at

the periphery of a graph Under the degree-1 constraint edit operations can be performed only on vertices with degree≤1 The ordered and unordered tree algorithms have a worst-case execution time of O(|T1|.|T2|.k2log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1|2.|T2|2.k2log k) Sec-

ond, we consider the problem of approximate matching of a string with a special

type of regular expression where the kleene closure (*) is only allowed to be bound

to a single character In this regard, we present a new algorithm which exploitsthe special properties of such a regular expression, thereby enabling us to reducethe approximate regular expression matching problem to that of an approximate

string matching problem Our algorithm for approximate matching of a string S with a class 2 regular expression R, which we designate as RP M , runs in O(|S|3)time and space in the worst case Our performance evaluation indicates that ourproposed techniques indeed outperform an existing well-known algorithm for ap-proximate regular expression matching in terms of execution times This may beprimarily attributed to the approximate string matching nature of our algorithmwhich makes use of simple arithmetic operations

We plan to extend the work done in this thesis in the near future by trying toaddress more complex issues in the field of the approximate pattern matching Thisresearch effort has laid the foundation for considering the problem of approximatematching of two regular expressions, but we believe that there are still a lot ofopen research issues in this field In addition to this, we also aim to try and employmultiple sequence alignment techniques in order to derive a valid or an optimalschedule in a client-server architecture under delay constraints

Trang 9

Approximate pattern matching techniques in various structures such as strings,trees, graphs and regular expressions form the basis of many important as well as di-verse commercial applications ranging from traditional applications associated withinformation extraction to more specialized applications involving bio-informatics.The World Wide Web (WWW) is growing at an exponential rate with new web-sites emerging everyday The WWW hosts and serves large amounts of documentscontaining data (primarily in textual form) pertaining to essentially all domains ofhuman activity e.g., art, education, travel, science, politics and business, thereby

making it a very large-scale distributed global information resource residing on the

Internet Notably, the information on the WWW is potentially useful for bothindividuals and businesses There is an increasing need for the convergence ofdatabase and information retrieval support to new application domains such asinformation interchange over the Internet with XML, bio-computing, distributeddirectory servers with LDAP, or the management of hypermedia over the World

3

Trang 10

Wide Web At the heart of this convergence is the possibility of evaluating thesimilarity of the objects in question according to appropriate metrics Objects inthe applications mentioned above have in common their complex structure that isoften that of a tree or a graph Trees and graphs approximate matching algorithmsprovide a variety of similarity measures for these objects.

The HyperText Markup Language (HTML) is the lingua franca for publishingdata on the Web Unfortunately, HTML has been designed for display purposes,the implication being that it is primarily meant for human consumption as opposed

to machine consumption But for the WWW to reach its full potential, the datashould be defined and linked in such a way that it can be used for more effectivediscovery, automation, integration, and reuse across various applications Now, justlike people need to have agreement on the meanings of the words which they employ

in their communication, computers need mechanisms for agreeing on the semantics

of the data in order to communicate effectively The World Wide Web Consortium(W3C), in collaboration with a large number of researchers and industrial partners,

is now exploring the possibility of creating a Semantic Web, in which the meaning ismade explicit (through the use of meta-data), thereby allowing machines to processand integrate Web resources intelligently

Intuitively we can understand that documents in a markup language can ally be represented as a tree as illustrated in Figure 1.1 Being able to evaluate thesimilarity between two documents is the basis for search and integration Schemamatching is a basic problem in many database application domains, such as dataintegration, E-business, data warehousing and semantic query processing Auto-matic schema matching has become one of the key areas of research in the field

gener-of computer science due to the rapid increasing number gener-of web data sources andE-businesses to integrate As systems become able to handle more complex data-

Trang 11

Figure 1.1: HTML Code and its tree representation

bases and applications their schemas become larger, further increasing the number

of matches to be performed Seeing how data in an HTML or XML document isprimarily stored at the leaves or on the periphery of the tree or graph that describes

it, we devise an algorithm that takes advantage of this property when comparingtwo different structures

Although data may be represented in various ways, text still remains the mary medium for exchanging information on the WWW This is particularly ev-ident in the domains of literature or linguistics where data are composed of hugecorpus and dictionaries This is also applicable to computer science, where a largeamount of data are stored in linear files Given such predominantly textual nature

pri-of the data residing in the WWW, efficient pattern matching algorithms and pression techniques become necessary to address semantic issues associated withthe data While the problem of pattern matching pertains to locating a specificpattern inside raw data (a pattern is usually a collection of strings described insome formal language), the aim of data compression is to provide representation ofdata in a reduced form in order to save both storage space and transmission timesuch that there is no loss of information (the compression processes are reversible)

Trang 12

com-Incidentally, both pattern matching and compression techniques apply to themanipulation of texts (word editors), the storage of textual data (text compres-sion) and data retrieval systems (full text search) Additionally, they are basiccomponents used in the implementations of practical softwares existing under mostoperating systems Moreover, they emphasize programming methods that serve asparadigms in other fields of computer science (system or software design) Finally,they also play an important role in theoretical computer science by providing chal-lenging problems In this thesis, we specifically focus on the problem of patternmatching.

Pattern matching of textual data arises in several important commercial plications Sequential pattern mining, i.e., the mining of frequent subsequences

ap-as patterns in a sequence databap-ase, is an important data mining tap-ask with broadapplications Interestingly, this is also the case in molecular biology because bio-logical molecules can often be approximated as sequences of nucleotides or aminoacids Furthermore, the amount of available data in these fields tends to dou-ble every eighteen months, thereby underlining the necessity of efficient patternmatching algorithms even if the speed and storage capacity of computers increaseregularly Moreover, pattern matching is also a key part of various other appli-cations including analysis of consumer behaviors, web access patterns, processanalysis of scientific experiments, prediction of natural disasters, to mention just

a few Incidentally, ordered, labeled trees are often deployed in pattern ing Ordered, labeled trees are trees in which each vertex has a label and theleft-to-right order of its children (if any) is fixed Such trees have many applica-tions in vision, pattern recognition, molecular biology, programming compilation,and natural language processing Many of the applications involve comparing trees

match-or retrieving/extracting infmatch-ormation from a repositmatch-ory of trees Examples include

Trang 13

classification of unknown patterns, analysis of newly sequenced RNA structures,semantic taxonomy for dictionary definitions, generation of interpreters for non-procedural programming languages, and automatic error recovery and correctionfor programming languages.

Multiple sequence alignment can be seen as a generalization of the pairwise

sequence alignment - instead of aligning two sequences, k sequences are aligned simultaneously, where k is any number greater than 2 Multiple sequence alignment

is particularly useful in the field of bioinformatics because it allows biologists toextract and represent biologically important but faintly/widely dispersed sequencesimilarities giving them hints about the evolutionary history of certain sequences.The problem of multiple sequence alignment has been shown to be NP-Complete

in general and is therefore not likely to be solved in polynomial time For Complete problems, there is (almost) no hope that there is an algorithm that isnot exponential in its complexity The algorithm for multiple sequence alignmenthas a time complexity of Θ(2N L N ) and a space complexity of Θ(L N) It turnsout that not all cells of the cube (for a 3 sequence case) and in general, the N-dimensional matrix need to be computed, and the order of computation can also

NP-be heuristically optimized In addition to uses in bio-informatics multiple sequencealignment could probably also be used to generate a optimal or in most cases avalid schedule given certain delay constraints in a client server architecture Wetouch on this possible application of multiple sequence alignment in the future worksections in Chapter 5

Approximate matching of regular expressions forms the basis of many searchprocedures in various applications Searching for a pattern in a text file is a verycommon operation in many applications ranging from conventional applicationssuch as text editors to more sophisticated applications in molecular biology Inci-

Trang 14

dentally, schema matching is also a basic problem in many database applicationsdomains such as data integration, E-business, data warehousing and semantic queryprocessing Schema matching is usually performed manually through some form

of a graphical interface Such methods have the disadvantages of being some, time-consuming and error-prone By combining existing schema matchingtechniques with customized approximate pattern matching algorithms for regularexpressions, one can automate an otherwise largely manual operation Rahm et

cumber-al [19] survey a number of approaches to automatic schema matching

1.1 Summary of Contributions

This work focusses on problems of approximate pattern matching in complex tures such as trees, graphs and regular expressions The contributions of this thesisare two-fold

struc-• We present new algorithms for the approximate matching of trees (ordered

and unordered) and acyclic graphs based on edit distance measures under

the degree-1 constraint, the implication being that the relevant information

is located at the leaves of a tree or at the periphery of a graph Under the

degree-1 constraint edit operations can be performed only on vertices with

degree is less than or equal to 1 The ordered and unordered tree algorithms

have a worst-case execution time of O(|T1|.|T2|.k2log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1|2.|T2|2.k2log k).Our work on approximate matching of trees and acyclic graphs under the degree-1

constraint has been submitted for publication to a well known journal and

we are currently awaiting the results of the review

• We consider the problem of approximate matching of a string with a special

Trang 15

type of regular expression where the kleene closure (*) is only allowed to be

bound to a single character In this regard, we present a new algorithm whichexploits the special properties of such a regular expression, thereby enabling

us to reduce the approximate regular expression matching problem to that of

an approximate string matching problem Our algorithm RP M for mate matching of a string S with a class 2 regular expression R runs in O(|S|3)time and space in the worst case Our performance evaluation indicates thatour proposed techniques indeed outperform an existing well-known algorithm

approxi-Myers [45] for approximate regular expression matching in terms of execution

times This may be primarily attributed to the approximate string ing nature of our algorithm which makes use of simple arithmetic operations,

match-whereas Myers needs to first construct a regular expression edit graph before

it starts traversing the edges of this graph to provide the desired result

1.2 Organization of the Thesis

The remainder of the thesis is organized as follows:

• Chapter 2 provides an overview of existing works in the field of approximate

matching of strings, trees, graphs and regular expressions

• Chapter 3 presents our algorithm for approximate matching of ordered and

unordered trees and acyclic graphs under the degree-1 constraint.

• Chapter 4 discusses the work we have done in the area of string to regular

expression matching in a special case and we compare our algorithm with anexisting string to regular expression approximate matching algorithm

• Finally, we conclude in Chapter 5 with directions for future work.

Trang 16

Related Work and Background

Information

This chapter describes existing works in the field of approximate matching ofstrings, trees, graphs and regular expressions We begin our discussion in the field

of approximate matching by first looking at the problem of approximate matching

of strings in Section 2.1 We present the notion of edit operations on strings andedit distance with respect to strings and introduce the edit distance matrix, a vi-sualization which is central to many exact and approximate matching algorithms

We then present a brief survey on some of the key exact and approximate matchingalgorithms on strings In Section 2.2, we discuss the issue of approximate matching

in more complex structures like trees and graphs This is a relatively new area

of research in approximate matching when compared to strings We then discusssome of the existing work done on different types of tree and graph structures

In Section 2.3, we aim to give a brief overview of the work done on approximate

10

Trang 17

matching of strings with regular expressions.

2.1 Approximate Matching in Strings

The string matching problem is the most studied problem in algorithmics on wordsand there are many algorithms for solving this problem efficiently In practical

pattern-matching applications, exact matching is not always relevant It is often

more important to find objects that match a given pattern in a reasonably imate manner, i.e., allowing some errors

approx-Approximate string matching consists in finding all approximate occurrences of

pattern x in text y There exists a number of methods to compare two strings or

sequences One of the most common ways is the notion of similarity between two

strings A similarity measure is a function that associates a numeric value with

a pair of sequences, with the idea that a higher value indicates greater similarity

A similarity measure can have both positive and negative values depending on theproperties of the scoring function used

The notion of distance is somewhat dual to similarity It treats sequences as

points in a metric space A distance measure is a function that also associates

a numeric value with a pair of sequences, but with the idea that the larger thedistance, the smaller the similarity, and vice versa Distance measures usuallysatisfy the mathematical axioms of a metric In particular, distance values are

never negative Approximate occurrences of x are segments of y that are close to x according to a specific distance: their distance to x must not be greater than a given integer k There are two standard distances: the hamming distance and the edit distance The Hamming distance, H is defined only for strings of the same length For two strings S and T , H(S, T ) is the number of places in which the two string

Trang 18

differ, i.e., the places where they have different characters The Hamming distance

is related to the number of mismatches between the pattern and its approximateoccurrences This problem is also called the approximate string matching with

k mismatches For example, if S = aababb and T = bbbaba then H(S, T ) = 3

corresponds to the mismatches in positions 1,2 and 6 in the strings The edit distance between two strings is the minimal number of edit operations (insert,

delete and change) needed to transform one string into the other It is defined for

strings of arbitrary length For example, if S = aab and T = accab, the minimum number of edit operations required to transform S to T is 2 corresponding to the deletion of the 2 c’s There could be more than one sequence of operations to

transform one string into the other If each operation is assigned the same cost,this is also known as the Levenshtein distance We shall describe in detail properties

of the edit distance between strings later in this section

The longest common subsequence (LCS) problem is a particular case of the edit distance problem in strings Given two strings S and T of length n and m respectively, if l = LCS(S, T ), then one can transform S to T by first deleting the

n − l characters of S (all but those of a longest common subsequence) and then

inserting m−l symbols to get T For example, if S = abbcc and T = acb with n = 5 and m = 3, LCS(S, T ) = ab or ac with l = 2 We first delete 3 (5-2) non-LCS characters from S, bcc and then insert 1 (3-2) non-LCS characters from T , c in the correct position to obtain T from S.

S:

T:

b _

a a

c c

a a

c c

a a

_ d

a a

Figure 2.1: An alignment for S = abcaacaca and T = acaacacda

Trang 19

Another way of representing the differences (or similarity) between two strings(or sequences), which is one of the central concepts in bio-informatics, is the notion

of alignment An alignment is a mutual arrangement of two sequences such that

it exhibits where the two sequences are similar, and where they differ An optimal alignment is understandably one that exhibits the most correspondences, and the

least differences The alignment problem is another interesting variation of the edit

distance problem where gaps or ‘empty’ strings are inserted in each of the two

strings such that common characters are matched, whereas characters ‘unmatched’can be inserted or deleted Figure 2.1 depicts a possible alignment for strings

S = abcaacaca and T = acaacacda which corresponds to a deletion of b in S and

an insertion of d in T

Before we move on to the major algorithms on string matching, we shall firstpresent a definition of a string proposed by Crochemore and Rytter(1994)[18]

Let Σ be an input alphabet - a finite set of symbols Elements of Σ are called

characters A string over Σ is defined as a finite sequence of elements of Σ The length of a string S, |S|, is defined as the number of elements (with repetitions) in

the string S Therefore the length of abbab is 5.

The i th element of the string S is denoted by S[i] and i is its position in S.

For example, the 4th character in the string pattern is S[4] = ‘e’ A substring of

S, denoted by S[i j] is the sequence of elements S[i]S[i + 1] S[j] in S For

example, pat is a substring of pattern A string S seq is a subsequence of S if S seq can be obtained from S by removing zero or more (not necessarily adjacent) letter from it For example, pen is a valid subsequence of pattern Intuitively, S seq is

a subsequence of S if S seq = S[i1][i2] [i m ] where i1, i2, , i m is an increasing

Trang 20

sequence of indices in S.

Given two strings, S1 and S2 of length m and n (m ≤ n) respectively, the very basic form of the exact string matching problem is a membership decision problem, i.e verify if S1 occurs in S2 The output is a boolean value S1 is either a member

of S2 or it isn’t As mentioned earlier in Section 2.1, very often this is not very

helpful A more interesting scheme would be to see how far S1 is from S2

The Approximate String Matching problem is defined as the problem of forming S1to S2 via a series of edit operations Now we shall discuss edit operations

trans-on strings

Edit Operations in Strings

When two strings S1 and S2do not exactly match, there exists errors corresponding

to the differences between the two strings Let ∅ represent an empty structure An edit operation can be represented as a pair, (u, v) 6= (∅, ∅) sometimes written u → v.

There are three kinds of edit operations:

1 change: symbols at corresponding positions are different A change tion is represented by (u, v) where u 6= ∅ and v 6= ∅.

opera-2 insert : a symbol of S2 is missing in S1 at a corresponding position An

insert operation is represented by (u, v) where u = ∅ and v 6= ∅.

3 delete : a symbol of S1 is missing in S2 at a corresponding position A delete

operation is represented by (u, v) where u 6= ∅ and v = ∅.

Trang 21

Edit Distance between Strings

One would always look for the best way to transform S1 to S2, i.e., the minimum

number of differences between S1 and S2 This can be translated as the smallest

number of edit operations (change, insertion and deletion) to transform S1 to S2

This is called the edit distance between S1 to S2 and is denoted by δ(S1, S2).Three properties are satisfied at all times They are as follows:

• δ(S1, S2) = 0 if f S1 = S2 : If both the strings are the same the minimumdistance between them is 0 as no edit operation is required to transform oneinto the other

• δ(S1, S2) = δ(S2, S1) : A fundamental property of the edit distance is that

it is symmetric This comes from the duality between the deletion and the

insertion operations A deletion of a character a in S1 in order to get S2corresponds to an insertion of a in S2 to get S1

• δ(S1, S2) ≤ δ(S1, S3) + δ(S3, S2) (triangle inequality)

Central to many of the algorithms based on the edit distance scheme is the edit

distance matrix Assuming the strings S1 and S2 are of fixed length m and n such that n ≥ m Each cell in the edit distance matrix, EDIT , has a value equal to

δ(S1[1 i], S2[1 j]), with 0 ≤ i ≤ m and 0 ≤ j ≤ n The boundary values of

EDIT are defined as follows :

for 0 ≤ i ≤ m, 0 ≤ j ≤ n, EDIT [0, j] = j; EDIT [i, 0] = i;

Rest of the elements of EDIT can be computed using the simple formula

Trang 22

EDIT [i, j] = min

(i − 1, j − 1) is connected to three other vertices: (i − 1, j), (i, j − 1), (i, j) when

i ≤ m, j ≤ n, the edit distance between strings S1 and S2 equals the length of a

least weighted path in this graph from source (0, 0) to the sink (m, n) An edge from (i − 1, j − 1) to (i, j − 1) represents a deletion edge and has a discrete integer cost assigned to it represented by cost(delete) An edge from (i−1, j −1) to (i−1, j)

represents a insertion edge and has a discrete integer cost assigned to it represented

by cost(insert) An edge from (i − 1, j − 1) to (i, j) represents a replacement edge and is represented by cost(change )=δ(S1[i], S2[j]).

Figure 2.2: Edit Distance Matrix for S1 = abba and S2 = aabaca

Figure 2.2 shows the edit distance matrix for strings S1 = abba and S2 = aabaca.

We assume each edit operation (delete, insert and change) has a cost of 1 The

minimum edit distance, δ(S1, S2) is represented by EDIT [4][6] = 3 This reflects the change operation b → a and the insertion of c and a There are several paths from source to sink in this case One possible path is EDIT [0][0] → EDIT [1][1] →

Trang 23

EDIT [2][2] → EDIT [3][3] → EDIT [4][4] → EDIT [4][5] → EDIT [4][6]

The naive exact matching algorithm in strings locates all occurrences in time

O(nm) But hashing provides a simple method that avoids the quadratic ber of symbol comparisons in most practical situations, and that runs in lineartime under reasonable probabilistic assumptions (Harrison (1971)[25] and Karpand Rabin (1987)[31]) The two most famous exact string matching algorithmswere devised by algorithms Morris and Pratt (1970)[30] and Boyer and Moore(1977)[9] The first linear-time string-matching algorithm was developed by Morrisand Pratt (1970) It was improved by Knuth, Morris, and Pratt (1976)[34] The

num-algorithm’s preprocessing phase computes in O(m) space and time complexity and its searching phase computes in O(n + m) time complexity (independent from the

alphabet size) The Boyer and Moore’s algorithm (1977) is considered as the mostefficient string-matching algorithm in usual applications The Boyer-Moore algo-

rithm consists of preprocessing phase which computes in O(m + σ) time and space complexity and a searching phase which computes in O(mn) time complexity A

simplified version of it (or the entire algorithm) is often implemented in text editorsfor the “search” and “substitute” commands Several variants of Boyer and Moore’salgorithm avoid the quadratic behavior when searching for all occurrences of thepattern The most efficient solutions in terms of number of symbol comparisonshave been designed by Apostolico and Giancarlo (1986)[5] and Colussi (1994)[15].Although the idea of approximate pattern matching is ubiquitous in informa-tion processing, it first clearly appeared in the earlier work on approximate stringmatching Wagner and Fisher (1974)[65] define an edit distance between two stringsand an algorithm for its computation The distance between two strings is given

Trang 24

by the minimum number of operations: insertion of a letter, deletion of letter or

change of a letter, required to transform one string into the other The authors

present an algorithm which runs in O(nm) time where each cell in the matrix is the

minimum cost of a delete operation on the cell to the top of it, an insert operation

on the cell to the left of it and a change operation on the immediate cell to its left

diagonal The edit distance between the two strings is represented in the final cell

of the two dimensional matrix This is the unit edit distance Different operations

can be assigned a different weight

The notion of a longest common subsequence (LCS) of two strings is widely

used to compare files The diff command of UNIX system implement an algorithm

based on the notion that lines of the files are considered as symbols Informally, the

result of a comparison gives the minimum number of operations (insert a symbol,

or delete a symbol) to transform one string into the other The comparison of

molecular sequences is basically done with a closed concept, alignment of strings,

which consists in aligning their symbols on vertical lines This is related to an

edit distance, called the Levenshtein distance, with the additional operation of

substitution, and with weights associated to operations Hirschberg (1975)[27]

presents the computation of the LCS in linear space Aho, Hirschberg and Ullman

(1976)[3] show that unless a bound on the total number of distinct symbols is

assumed, every solution to the problem can consume an amount of time that is

proportional to the product of the lengths of the two strings

In this section we present a time line of some of the major exact and approximate

string matching algorithms

[30]1970 →[25]1971 →[65]1974 →[27]1975 →[3]1976 →[9, 34]1977 →[5]1986 →[31]1987 →[15]1994

Trang 25

2.2 Approximate Matching in Trees and Graphs

Approximate pattern matching of complex structures such as trees and graphs is

a primitive operation essential to applications in information retrieval, tion integration and mediation, and in many such domains that require evaluating

informa-or characterizing the similarity between structured and complex objects such asHTML documents, molecular compounds and XML data

A tree is a special subset of a graph A tree is a graph which contains no cycles

We can visualize a tree by drawing it with a root at the top with the vertices belowleading to the leaves at the lowest If the vertices are placed on levels, higher levelvertices are referred to the parents of the vertices directly below them, while the

lower vertices are similarly referred to as their children A tree with n vertices has n − 1 edges Although maybe not part of the widest definition of a tree, a

common constraint is that no vertex can have more than one parent Moreover,for some applications, it is necessary to consider a vertex’s daughter vertices to be

an ordered list, instead of merely a set As a data structure in computer programs,

trees are used in everything from B − trees in databases and file systems, to game

trees in game theory, to syntax trees in a human or computer languages

Trang 26

An ordered tree is a tree in which the relative order of the subtrees meeting

at each vertex must be preserved, i.e the left to right order of children of everyvertex matters

7 a

Figure 2.3: Ordered Tree Example

Figure 2.3 shows an ordered tree where each vertex is labeled with a characterand the vertices post order number adjacent to the vertex

An unordered tree is a tree in which the relative order of the subtrees meeting

at each vertex need not be preserved, i.e., the left-to-right order of the children of

every vertex does not matter.

Figure 2.4: UnOrdered Trees Example

Figure 2.4 shows two vertex labeled trees which are exactly the same if theyare considered to be unordered The left-to-right ordering of the children does notmatter in the unordered case

Given a tree, it is usually convenient to use a numbering scheme to refer to the

vertices of the tree For an ordered tree T , the left-to-right postorder numbering

Trang 27

or left-to-right preorder number are often used to number the vertices of T from 1

to |T |, the size of the tree T For an unordered tree, we can fix an arbitrary order

for each of the vertices in the tree and then use left-to-right postorder numbering

or left-to-right preorder numbering Suppose that we have a numbering for each

tree Let t[i] be the i th vertex of tree T in the given numbering We use T [i] to denote the subtree rooted at t[i] Let t[i1], t[i2], , t[i ni ] be the children of t[i] The

interesting property of the postorder numbering scheme is that children of a parent

are always assigned a number lower than that of the parent This fits in perfectly

as in many dynamic algorithms it is crucial for the children to be processed firstbefore the parent

Edit Operations in Trees and Graphs

There are three kinds of edit operations in trees and graphs:

1 relabel : Relabeling a vertex x means changing the label on x.

2 delete : Deleting a vertex x means making the neighbors of x (except an bitrarily specified neighbor x 0 ) become the neighbors of x 0 and then removing

Trang 28

(a) Delete vertex

+

(b) Insert vertex

Figure 2.5: Delete and Insert Edit operations on Trees

A neighbor of a character s in a string is the characters s 0 and s” on either side

of s, where s 0 and/or s” may or may not be empty A neighbor of a vertex x in

a tree or a graph is any vertex x 0 that is directly connected to x by a single edge.

Although we are only concerned with inserts and deletes for our algorithms, wedescribe the relabeling operation for the sake of completeness

Edit Distance between Trees/Graphs

Given two graphs G1 and G2, there are several methods of performing mate pattern matching between the two structures One way is to measure the

approxi-edit distance, i.e the minimum cost of transforming one structure into the other

quite often through a series of edit operations, i.e deletion of a vertex in G1,

in-sertion of a vertex in G2 and the relabeling of a vertex in G2 with the label of a

vertex in G1 The edit distance between any two graphs G1 and G2 is denoted

by δ(G1, G2) Each edit operation can be assigned a numeric cost (not necessarilydistinct) The edit distance is in fact a distance metric as it satisfies the basic rules

of symmetry and triangle inequality

Trang 29

2.2.2 Problem Definition

Approximate tree matching is a generalization of approximate string matching.Given two trees, we view one tree as the pattern tree and the other as the datatree The idea is to match approximately the pattern tree to the data tree Given

two trees T1 and T2, the task of transforming T1 to T2 or T2 and T1 via a sequence

of edit operations is termed as the problem of approximate pattern matching intrees

Several definitions and algorithms have been given for the approximate matching

of graphs and trees They correspond to different data structures (ordered trees,unordered trees, graphs, etc), different notions of similarity or distance, and differ-ent constraints Tai (1979) [60] was one of the first authors to work on the topic ofapproximate pattern matching of trees He gave the definition of the edit distancebetween ordered, labeled trees and the first non-exponential algorithm to compute

it Tai used a pre-order numbering scheme to number the trees The convenient

aspect of this notation is that for any i, 1 ≤ i ≤ |T |, vertices from T [1] to T [i] is

a tree rooted at T [1] He incorporated the same approach as sequence editing and came up with an algorithm that runs in O(|T1|.|T2|.depth(T1)2.depth(T2)2) timeand space Lu (1979) [40] presents another algorithm on ordered trees based on

the edit operations presented by Tai Let t1[i1], t1[i2], , t1[i ni] be the children of

t1[i] and t2[j1], t2[j2], , t2[j nj ] be the children of t2[j] Lu considers the following three cases (1) t1[i] is deleted - in this case the distance would be to match T2[j] to one of the subtrees of t1[i] and then to delete all the rest of the subtrees, (2) t2[j] is inserted - in this case the distance would be to match T1[i] to one of the subtrees

of t2[j] and then to insert all the rest of the subtrees, (3) t1[i] matches t2[j] - in

Trang 30

this case, consider the subtrees t1[i1], t1[i2], , t1[i ni ] and t2[j1], t2[j2], , t2[j nj] astwo sequences and each individual subtree as a whole entity He then uses the se-

quence edit distance to determine the distance between t1[i1], t1[i2], , t1[i ni] and

t2[j1], t2[j2], , t2[j nj] This algorithm considers each subtree as a whole entity It

does not allow one subtree of T1 to map to more than one subtree of T2

Kilpelainen and Mannila (1995)[32] introduced the tree inclusion problem i.e.,

given a pattern tree P and a target tree T , tree inclusions asks whether it is possible

to obtain P by strictly deleting vertices of T Both ordered and unordered trees are considered Since there may be exponentially many ordered embeddings of P to T , they assume that P and T have the same label, their algorithm tries to embed P into T by embedding the subtrees of P as deeply and as far to the left as possible The time complexity of their algorithm is O(|T1|.|T2|) and they showed that the

unordered inclusion problem is NP-Complete

Shasha and Zhang(1990,1991) [57, 71], define an edit distance for ordered beled trees and propose algorithms for its computation Jiang, Wang and Zhang(1994)[29] address the problem of approximate matching in ordered labeled trees

la-by inserting empty vertices to align the structure of ordered labeled trees Thedistance is defined as the sum of the score of the opposing labels after the struc-turally similar graphs are overlayed Shasha et al (1994) [56] propose severalenumerative algorithms for the approximate matching of unordered labeled trees.The algorithms are based on probabilistic hill climbing and bipartite graph match-ing [10, 22] The authors Luccio and Pagli (1991) [41], present algorithms for theapproximate matching of H-ary trees and arbitrary ordered trees Wang, Zhangand Chirn (1994) [67] define an edit distance for labeled graphs and describe an

algorithm they call inexact graph matching in terms of finding a minimal

transfor-mation cost between the graphs Vilares, Ribadas and Grana (2001) [63] present

Trang 31

a proposal intended to demonstrate the applicability of tabulation techniques fordetecting approximately common patterns when dealing with structures sharingsome common parts based on approximate pattern matching of two ordered la-beled trees Finally, Zhang, Wang and Shasha (1995) [72] introduce the notion of

degree-2 constraint to define an edit distance between undirected acyclic graphs

and propose algorithms for its computation

in a text file is a very common operation in many applications ranging from texteditors and databases to applications in molecular biology

Following [1], regular expressions and the strings they match recursively can besummed up as

Trang 32

1 |()* are metacharacters

2 A non-metacharacter a is a regular expression that matches the string a

3 If r1 and r2 are regular expressions, then (r1|r2) is a regular expression that

matches any string matched by either r1 or r2

4 If r1 and r2 are regular expressions, then (r1)(r2) is a regular expression that

matches any string of the form xy, where r1 matches x and r2 matches y.

5 If r is a regular expression, then (r)* is a regular expression that matches any string of the form x1, x2, , x n , n ≥ 0, where r matches x i for 1 ≤ i ≤ n (r)* also matches the empty string, represented by ² * is also known as the

Kleene Closure operator

6 If r is a regular expression, then (r) is a regular expression that matches the same string as r.

The notation of regular expressions arises naturally from the mathematical sult of Kleene [33] that characterizes the regular sets as the smallest class of sets

re-of strings which contains all finite sets re-of strings and which is closed under theoperations of union, concatenation and “Kleene Closure”

Given a string S and a regular expression R, the problem of approximate matching

of a string and a regular expression is to find a string S R ∈ L(R), where L(R) is

the language defined by the regular expression R, such that the difference (editing distance) between S and S R is the least

Formally, ∆(R, S) = min S R ∈L(R) δ(S, S R)

Trang 33

2.3.3 Algorithms

The bulk of the research on regular expressions has been done on checking to see

if an input string belongs to a regular expression and in some cases how “far” thestring is from being a member of the regular expression Waterman (1984)[68] re-views several mathematical methods for comparison of nucleic acid sequences Hediscusses the problem of comparison of several sequences which is a slight simplifi-cation of the regular expression matching problem Thompson (1968) [61] describes

a regular expression recognition technique where each character in the text to besearched is examined in a sequence against a list of possible current characters.During the examination a new list of all possible next characters is built Whenthe end of the current list is reached a new list becomes the current list the nextcharacter is obtained and the process continues Wagner (1974) [64] presents anerror correction algorithm that acts as a preprocessor of sorts which accepts thepossibly illegal source string and translates that source string into a guaranteed

syntactically legal string based on the minimum edit distance between a string B belonging to a given regular language L which is “nearest” (in number of edit operations) to a given input string α Knight et al (1995) [32] delve into the problem

of approximate pattern matching of regular expressions with concave gap penalties

and presents an O(MP (logM +log2P )) algorithm for its computation where M and

P is the size of the input string and regular expression respectively The concave

gap penalty scheme is a symbol independent gap-cost model where the cost of the

gap is solely a function of its length Myers (1992) [44] presents a O(P N/logN ) where P is the length of a regular expression R and N is the length of the word

A to determine if A is in the language denoted by R The algorithm is based on a log N speedup of the standard O(P N ) time simulation of R 0 s NDFA on A using

a combination of node-listing and ”Four-Russians” [6] paradigms Eppstein et al

Trang 34

(1993) [21] look into the problem of sequence alignment and the prediction of RNAsecondary structure and present a common solution based on a common structurewhich can be expressed as system of dynamic programming recurrence equations.Myers et al (1989)[45] presents an algorithm to find a sequence matching a regu-

lar expression R whose optimal alignment with A is the highest scoring of all such sequences in O(MN ) time where M and N are the lengths of A and R respectively.

In this section, we present a timeline on some of the popular algorithms on imate matching of regular expressions

approx-[61]1968 → [64]1974 → [68]1984 → [45]1989 → [44]1992 → [21]1993 → [32]1995

Trang 35

Approximate matching of trees and

graphs under the degree − 1 constraint

Approximate pattern matching of complex structures such as trees and graphs is

a primitive operation essential to applications in information retrieval, tion integration and mediation, and in many such domains that require evaluating

informa-or characterizing the similarity between structured and complex objects such asHTML documents, molecular compounds and XML data For example, as moreand more autonomous organizations produce and exchange XML data, the XMLdocuments interchanged would be prone to spelling errors, syntactic or structuraldiscrepancies as well as other syntactic or semantic differences RDF [50] descrip-tions can be represented as acyclic graphs The ability to evaluate the similaritybetween two documents is the basis for search and integration Approximate pat-tern matching can also be used in the area of schema matching Automatingschema matching remains one of the challenging tasks in semi-structured data re-

29

Trang 36

search Schema matching is a key operation for many applications including dataintegration, schema integration and semantic query processing.

Approximate pattern matching in trees play a very important part in mation Extraction techniques The traditional approach for extracting data from

Infor-Web source is to write specialized programs called wrappers, that identify data of

interest and map them to some suitable format, for instance, XML or relational bles There are several existing approaches to web data extraction One of the firstinitiatives for addressing the problem of wrapper generation was the development

ta-of languages specially designed to assist users in constructing wrappers Some ta-ofthe best known tools to adopt this approach are Minerva [17] and TSIMMIS [24].Some tools, like W4F[53] and XWRAP[38], rely on the inherent structural features

of HTML documents for accomplishing data extraction by converting a HTML ument into a parsing tree, a representation that reflects its tag hierarchy Thereexists tools, like RAPIER [11] and WHISK[59], which take advantage of NaturalLanguage Processing (NLP) techniques such as filtering, part-of-speech tagging andlexical semantic tagging to build relationship between phrases and sentences ele-ments so that extraction rules can be derived Other tools rely solely on formattingfeatures that implicitly depict the structure of the pieces of data found which make

doc-it more sudoc-itable for HTML documents Examples of such tools are WIEN[36] andSTALKER[43] More information on the variety of wrappers available today can

be found in the survey of web data extraction tools by Laender et al [37]

Several tools [8, 46, 26] are available to assist users in tracking when web pageshave changed Liu, Pu and Tang (2000) [39] present WebCQ, a prototype systemfor large-scale web information monitoring and delivery The WebCQ system con-sists of four main components: a change detection robot that discovers and detectschanges, a proxy cache service that reduces communication traffics to the original

Trang 37

information servers, a personalized presentation tool that highlights changes tected by WebCQ sentinels, and a change notification service that delivers freshinformation to the right users at the right time The change detection and sum-marization phases makes use of a scheme which merges the two documents (beforeand after change) by summarizing all the common, new and deleted materials inone document as it is done in HTMLDiff and the UNIX diff command [62].

de-Automatic schema matching has become one of the key areas of research inthe field of computer science due to the rapidly increasing number of web datasources and E-businesses to integrate Most work on schema matching has beenmotivated by schema integration [47, 58] - given a set of independently developedschemas, construct a global view Schema matching is also useful in applicationsbeing considered for the semantic web [7], such as mapping messages betweenautonomous agents A somewhat different scenario is semantic query processing[66, 52] - a run-time scenario where a user specifies the output of a query and thesystem figures out how to produce that output

A significant amount of work has been done on comparison of conceptual graphsrepresenting knowledge elements In [69, 70], the authors address the task of ap-proximate matching of knowledge elements and present an algorithm for its com-parison by measuring the similarity between two texts represented as conceptualgraphs Change detection and monitoring techniques of web pages on the internethave been around for sometime now and are constantly evolving based on differentconstraints There exist commercial tools, [8], which inform users of when webpages are changed In [39], the authors present WebCQ, a prototype system forlarge-scale Web information monitoring and delivery It is designed to discover anddetect changes to the web pages efficiently and to provide a personalized notification

of what and how web pages of interest have been changed In [49] the authors show

Trang 38

the feasibility of automatically extracting data from web pages by using mate matching techniques This can be applied to generate automatic wrappers or

approxi-to notify/display web page differences, web page change moniapproxi-toring, etc In [51]the authors present an approach which collects a couple of example objects fromthe user and uses this information to extract new objects from semi-structureddata from web sources In each of the technologies mentioned above approximatematching of complex structures such as trees and graphs form an integral part.There are several methods of performing approximate pattern matching between

two or more structures One way is to measure the edit distance, i.e., the cost

of transforming one structure into the other, quite often through a series of editoperations

Depending on the requirements of the application and the type of the distancemeasure required, various constraints can be placed on the calculation of edit dis-tance For instance, HTML or XML documents share the property that the actualvalues carrying the information is most often at the leaves of the tree while innervertices represent the structural component of the document Therefore one could

require that they only be modified at the leaves This is the concept of degree-1

constraint presented in this chapter

In this regard, we shall focus on finding the edit distance between two plex structures such as ordered and unordered trees and acyclic graphs, under the

com-degree-1 constraint Under this constraint, edit operations can only be performed at

the leaf level of the tree or at the periphery of a graph The work in [72] addresses

the problem of comparing connected, undirected, acyclic and labeled graphs (CUAL

Graphs) In view of the challenge associated with the problem of finding the edit

distance between two CUAL graphs, proven to be NP-Complete, they propose a constrained distance metric, called the degree-2 distance which requires that any

Trang 39

node to be inserted or deleted have no more than two neighbors Their algorithm

runs in time O(N1N2D2) and in O(N1N2D √ D log D) where D = mind1, d2 and

d i is the maximum degree of G i The degree-1 constraint we describe in this text also serves to simplify the problem of finding the edit distance between two CUAL

graphs We argue the relevance of such a constraint which requires that edit erations can only be performed at the leaf level of a tree or at the periphery of agraph in practical situations

op-We describe the concept of edit distance under the degree-1 constraint in

Sec-tion 3.1 In SecSec-tion 3.2, we present three algorithms to evaluate the edit distance

between the complex structures under the degree-1 constraint In Section 3.3 we

analyze the time complexity of the three algorithms A simple example is presented

in Section 3.4 We then conclude this chapter with a summary in Section 3.5

3.1 The degree-1 Constraint

There are not only different notions of similarity, distance, and approximate ing corresponding to different data structures, but also there are different suchnotions for the same data structure corresponding to different needs The respec-tive efficiency of the algorithms computing the unit weight edit distance, the edit

match-distances defined under the degree-2 constraint and the edit distance we propose under the degree-1 constraint and others are not comparable since the notions cor-

respond to different needs for different applications Their effectiveness can only

be discussed in light of the requirements of the application

The degree of a vertex x in a CUAL structure is defined to be the number

of vertices directly connected to x by means of an edge Since the algorithms

presented in this text are primarily concerned with trees and acyclic graphs, the

Trang 40

definition of degree does not allow for self loops and multiple edges between two

vertices Before describing the degree-1 constraint we first clarify the notion of edit

distance and edit operations

Edit distance is defined to be the minimum number of edit operations required to

transform one structure to another, be it a string, a tree or a graph There arethree kinds of edit operations in trees and graphs: