Efficient processing of multiple XML twig queries

Based on the super-twig and index structure, we develop a new multiple twig queries processing algorithm, namely MTwigStack.. The answer to XPath queries is built by matching the twig pa

Trang 1

MULTIPLE XML TWIG QUERIES

LIU HUANZHANG

NATIONAL UNIVERSITY OF SINGAPORE

2007

Trang 2

Twig Queries

Liu Huanzhang

(B Eng Renmin University of China)

A THESIS SUBMITTED

FOR THE DEGREE OF MASTER OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

NATIONAL UNIVERSITY OF SINGAPORE

2007

Trang 3

I would like to express my sincere gratitude to my supervisor, Prof Ling Tok

Wang, for his guidance, stimulating suggestions, and patience His advice, insights and

comments have helped me tremendously throughout my master years

I would like to express my gratitude to all those who gave me the possibility to

conduct this piece of research and complete this thesis I also want to thank the

Department of Computer Science of the National University of Singapore for the strong

support for my research work

Lastly, I would like to thank my family and all the friends in Singapore and China,

for their understanding and support for my research work

Trang 4

List of tables viii

1.1 XML and XML query processing 1

1.2 Motivation and Objective 4

1.3 Contributions 6

1.4 Thesis Organization 7

2 Literature Review 9 2.1 Twig Pattern Query 9

2.2 XML Indexing and Labeling 11

ii

Trang 5

2.3 XML Filtering 15

2.4 Multiple XML queries processing 16

2.5 Summary 18

3 Preliminaries 19 3.1 XML Data Model 19

3.2 Twig Pattern and Twig Pattern Matching 20

3.3 Holistic Twig Join 23

3.4 Problem Statement 24

4 Utilizing Commonalities for Multiple Twigs 25 4.1 Defining Super-twig 25

4.1.1 Definitions 26

4.1.2 The differences between normal twig and Super-twig 30

4.1.3 The properties of Super-twig pattern 31

4.2 Constructing Super-twig 35

4.2.1 Implementing the Super-twig Structure 36

4.2.2 Algorithm for Constructing Super-twig 38

Trang 6

4.3 Conclusion 44

5 Processing Super-Twig Queries 45 5.1 Overview of the Architecture of Multiple Queries Processing System 45

5.2 The Index Structure for Parsed XML Data 48

5.3 Multiple Twig Queries Matching 49

5.3.1 Data Structure and Notations 50

5.3.2 The MTwigStack Algorithm 53

5.4 Conclusion 62

6 Experimental Evaluation 63 6.1 Experimental Setup 63

6.1.1 XML Documents 64

6.1.2 Query Sets 65

6.1.3 Metrics 67

6.2 Experimental results 68

6.2.1 MTwigStack vs TwigStack 68

Trang 8

This thesis studies the problem of efficient processing for multiple XML twig queries

processing We propose a new structure to present multiple twig patterns We also

design a novel algorithm to process multiple twig queries on an XML document

simul-taneously

XML emerges as the standard for representing and exchanging electronic data in

the Internet Recently, with more and more data being represented and exchanged

as XML documents over the Internet, people have focused on XML query processing

Queries in XML query languages typically specify patterns of selection predicates on

multiple elements that have some specified tree structured relationships, s the basis

for matching XML documents Finding all occurrences of a twig pattern in an XML

document is a core operation for XML query processing The emergence of XML as

a common mark-up language for data interchange also has spawned great interest in

techniques for filtering and content-based routing of XML data

We find that multiple twig queries against an XML database usually have many

similarities This inspires us to process multiple twig patterns simultaneously by sharing

common structure computation

We propose a new twig structure, which is called super-twig, to represent multiple

twig patterns The super-twig is a combination of multiple twig queries and contains

Trang 9

all nodes appearing in the queries To distinguish from a simple twig pattern,

Option-alNode and OptionalLeafNode are defined We also introduce optional parent-child and optional ancestor-descendant relationships An algorithm is designed for constructing

the super-twig Our experimental result shows that the cost is acceptable and linear

with the number of queries

In this these, we use region encoding scheme to label XML data We also design a

two-tier B+-tree index to store the labeled XML data Using the index structure, we

can process the super-twig with repeated tag names.

Based on the super-twig and index structure, we develop a new multiple twig queries

processing algorithm, namely MTwigStack With the algorithm, we can find all matches

of multiple twig queries simultaneously The experimental results show our method is

more efficient than other existing techniques when processing multiple twig queries with

high similarities

Trang 10

6.1 Characteristics of six XMark data sets 64

6.2 Characteristics of TreeBank data set 65

6.3 The time of computing the super-twig and processing it on 32K XMark

with ratio intermediatePaths being 3 69

viii

Trang 11

1.1 An fragment of an XML document 2

1.2 A twig pattern 3

1.3 Three twig queries (a,b,c) with high similarity and super twig query (d) 4 2.1 Xpath queries and their prefix tree 17

2.2 Xpath queries and their prefix tree 18

3.1 An example XML tree with region codes 20

3.2 A twig pattern p and its subpatterns sp B and sp C 22

4.1 Four twig patterns and their super-twig 30

4.2 An XML document fragment 31

4.3 An example for OptionalNode 32

ix

Trang 12

4.4 Four twig patterns and their super-twig 34

4.5 The scenario of one node appearing as both OptionalNode and OptionalLeafNode 35 4.6 The super-twig structure for the twig queries in Figure 4.1 37

4.7 The scenarios in the construction of super-twig 42

5.1 Overview of a multiple queries processing system 46

5.2 An XML document and SAX example 47

5.3 The two-tier B+-tree index for the document shown in Figure 4.2 50

5.4 Cursors and stacks during execution 52

5.5 Possible scenarios in the execution of MTwigStack 57

5.6 Illustration to MTwigStack 61

6.1 The execution of constructing the super-twig 66

6.2 Execution time on 2M XMark data with 10 queries 70

6.3 MTwigStack vs TwigStack on XMark with 10 queries 71

6.6 MTwigStack vs TwigStack on TreeBank with different numbers of queries 72

Trang 13

6.7 MTwigStack vs Index-Filter on XMark with 10 queries 75

6.10 MTwigStack vs Index-Filter on TreeBank with different numbers of queries 77

6.11 MTwigStack vs Index-Filter on 2M XMark data with the ratio of intermediate

paths being 3 78

Trang 14

XML is the abbreviation for eXtensible Markup Language XML is a simple, very

flexible text format derived from SGML (Standardized General Markup Language)

It employs a tree-structured model to represent data Originally designed to meet

the challenges of large-scale electronic publishing, XML is also playing an increasingly

important role in the exchange of a wide variety of data on the Web and elsewhere [4]

Recently, with more and more data being represented and exchanged as XML

doc-uments over the Internet, people have focused on XML query processing XPath [10]

is a simple but popular language to navigate XML documents and extract information

from them XPath is also used as sub-language of other XML query languages such as

XQuery [11] Since this language is popular, there has been a lot of work done to speed

1

Trang 15

up evaluation of XPath queries, such as index techniques [16, 24, 42, 34], structural

join algorithms [8, 13, 29, 39, 59] and minimization of XPath queries [23]

An XPath expression can be represented graphically by means of a twig pattern

with some structural properties between nodes and selection predicates on multiple

elements for matching XML documents Twig pattern matching has been identified as

a core operation in querying tree-structured XML data The traditional XML query

processing scenario involves asking a single query against a XML document The goal

here is to identify all matches to the input query in the XML document

Figure 1.1: An fragment of an XML document

For example, consider the document shown in Figure 1.1 containing some

infor-mation about a collection of books, and the query “find the titles of all the books for

which the author’s first name is ‘Jane’ ” This query can be formulated with the XPath

expression //book[//author/fn=‘Jane’]/title This expression is equivalent to the twig

Trang 16

pattern shown in Figure 1.2 The edge represented with a double line between book and

author corresponds to the symbol ‘//’ in the original expression and is called descendant (A-D) edge, which indicates author must appear as a descendant of book

ancestor-in the XML document; the edge represented with a sancestor-ingle lancestor-ine between author and fn

corresponds to the symbol ‘/’ in the original expression and is called parent-child (P-C)

edge, which indicates fn must appear as a child of author in the XML document The

answer to XPath queries is built by matching the twig pattern representing the query

Figure 1.2: A twig pattern

Moreover, the emergence of XML as a common markup language for data

inter-change has also spawned significant interest in techniques for filtering and

content-based routing of XML data In an XML filtering system, continuously arriving streams

of XML documents are passed through a filtering engine that matches the documents

to queries and routes, and the matched documents are distributed to corresponding

queries and routes There have been a number of efforts to build efficient large-scale

XML filtering systems, e.g., XFilter [9], XTrie [15], YFilter [20], and Index-Filter [12].

Trang 17

1.2 Motivation and Objective

In a huge system, where many XML queries are issued towards an XML database,

we expect to see that the queries have many similarities In traditional database

sys-tem, there are many studies on efficient processing of similar queries using batch-based

processing This inspires us to use a similar technique for twig pattern query

process-ing Since twig pattern matching is an expensive operation, it would save a lot in terms

of both CPU cost and I/O cost if we could group hundreds of similar twig pattern

queries together and only access the data file once to get all the results

Figure 1.3: Three twig queries (a,b,c) with high similarity and super twig query (d)

For example, consider the three twig queries in Figure 1.3 The main structures

of these three patterns are same They all query book elements which have a child

element and a descendant author element Figure 1.3 (a) identifies book element which

has a title value “XML” and has an author element as its descendant Figure 1.3 (b)

identifies book element which has a title as its child and whose author’s first name (fn)

is “Jane” Figure 1.3 (c) is similar to (b), but it requires that title value is “XML”.

We can combine these three queries into one twig pattern by: (i) sharing their common

Trang 18

prefixes (e.g., root node book, element node title and author ); (ii) union their different

parts (e.g., value “XML”, element fn, and value “Jane”), as shown in Figure 1.3 (d).

The twig pattern in Figure 1.3 (d) is a new structure we proposed to present these

twig queries and will be introduced in Chapter 4 Obviously, if we designed a method

processing the twig pattern in Figure 1.3 (d) to obtain the results of twig queries in

Figure 1.3 (a), (b) and (c), then we will only scan the book, title and author element

list one time respectively

Furthermore, in a filtering system or content-based routing system, queries and user

profiles are usually expressed by XPath expression These systems only identify the

query expressions that there exist match in input XML document and disseminate the

input XML data to the users who posted the queries But the systems do not need

to find all matches for each query Hence users have to scan coming XML documents

again to obtain exact information

The work we present in this thesis is motivated by the batch query processing

in relational database and processing multiple queries in XML filtering systems We

try to identify query commonalities and combine multiple similar queries into a single

structure, which we call super-twig The results returned by the super-twig contain the

results of all the given queries

We observe that in the recent development of twig pattern queries, TwigStack [13]

has been identified as an effective approach We propose a new algorithm based on

TwigStack, which is called MTwigStack, to find all occurrences of the super-twig pattern

Trang 19

in an XML document Then, matching fragments are distributed to corresponding twig

queries respectively This algorithm ensures that super-twig matching only scan each

XML element at most once and as less than as it could, thus significantly reduce both

CPU cost and I/O cost compared to the na¨ıve approach which invokes TwigStack

algorithm once for each individual twig query, i.e scan each XML element N times if

the element tag is appeared in N twig queries.

1.3 Contributions

Motivated by the recent success in efficient processing multiple XML queries, we present

in this thesis a novel algorithm, called MTwigStack, to process multiple twig queries

simultaneously The contributions of this thesis can be summarized as follows:

• We review some work for optimizing evaluation of XPath queries, including index

techniques, structural join algorithms and minimization XPath queries; we also

review XML filtering systems and multiple queries processing techniques

• We introduce a new concept, called super-twig, which combines multiple twig

queries into just one twig pattern The super-twig contains all nodes appearing

in the queries, and the edges between any two nodes of the super-twig present the

original relationships between the two nodes in the queries

• We give the properties of the super-twig and present the structure for

implement-ing the super-twig We design the algorithm for constructimplement-ing super-twig pattern.

Trang 20

• Based on the super-twig, we develop a new multiple twig queries processing

al-gorithm With the algorithm, we can find all matches of multiple twig queries

simultaneously by scanning elements at most once and as less than as it could

• We compare our method with TwigStack [13] and Index-Filter [12] for

process-ing multiple twig queries Our experimental results show that the effectiveness,

scalability and efficiency of our algorithm for multiple twig queries processing

1.4 Thesis Organization

The rest of this thesis is organized as follows

In Chapter 2, we review some related work, including XML indexing and labeling,

structural join matching, XML filtering, and multiple XPath queries processing, etc

In Chapter 3, we present the preliminaries of XML It includes XML data model,

twig pattern and holistic twig matching This knowledge will be used for the further

research in this thesis

In Chapter 4, we will introduce the concept of super-twig for integrating multiple

twig patterns into one twig pattern First of all, we define the super-twig, which is

an extension of normal twig pattern, and describe how to construct and represent it

Next, we design a algorithm for constructing the super-twig It will produce an unique

formal expression for each XPath query and expedite constructing the super-twig.

In Chapter 5, we will describe our framework for processing multiple twig patterns

Trang 21

firstly Then we introduce the index structure for storing XML data in our method.

Based on the super-twig, we design a novel algorithm to match the super-twig against

an XML document

In Chapter 6, we compare our MTwigStack with TwigStack and Index-Filter on

both real and synthetic data sets We will show the experimental results and analyze

Trang 22

Literature Review

Many algorithms have been proposed to match XML twig pattern Zhang et al [59]

proposed a variation of the traditional merge join algorithm, the multi-predicate merge

join (MPMGJN ), based on two inverted list indexes: E-index (on element) and T-index

(on text) The positions of XML elements and string values are represented as (DocId,

LeftPos:RightPos, LevelNum) Al-Khalifa et al [8] identified tree-merge and stack-tree

algorithms to improve I/O and CPU performance using the same representation of

positions of XML elements In the two papers, they all decomposed the twig pattern

into binary structural relationships first Then they use structural join algorithms to

match the binary structural relationships and merge these matches A limitation of

these approaches is that intermediate result sizes may be very large because the join

9

Trang 23

results of individual binary relationships may not appear in the final results.

Later on, Bruno et al [13] improved the methods by proposing a holistic twig

join algorithm, called TwigStack In this algorithm, each query node of a twig pattern

has an element stream T q , which contains all the labels of document nodes with tag q

in an XML document The elements in the stream are sorted by their start position

(i.e the start value of the region-based code) Also, each node q is associated with a

stack S q, which helps the algorithm to generate intermediate partial results It uses

two phases: phase one outputs part of intermediate root to leaf paths and phase two

merges the intermediate root to leaf paths to get the final results The algorithm can

largely reduce the intermediate result comparing with the previous algorithms But

the method is found to be suboptimal if there are parent-child relationships in twig

queries That is, it may still generate uesless intermediate results in the presence of

P-C relationships in twig patterns

Jiang et al [30] proposed TSGeneric algorithm using XR-Tree [29] index to

im-prove twig pattern matching The method can skip elements and achieve sub-linear

performance for twig queries However it still does not resolve useless intermediate

results in the presence of P-C relationship Later on, an algorithm called

TwigStack-List [38] is proposed to answer the twig queries which contain parent-child relationship.

It makes use of a list data structure to cache elements that are potential answers to

the twig query Chen et al [17] researched the properties of structural twig join and

studied the tradeoff between the increase in overhead to manage more element streams

and the reduction in both I/O cost and intermediate result sizes caused by various

Trang 24

XML streaming schemes In this paper, the author proposed a new Tag+Level and

Prefix-Path scheme, and iTwigJoin algorithm to improve the TwigStack algorithm in

[13]

Jiang et al [28] proposed GTwigMerge algorithm based on [30] It focuses on

resolving OR-predicates in query twig patterns PathStack ¬ [31] and TwigStackList¬

[58] are proposed to answer queries with not-predicates Lu et al [40] propose a novel

algorithm, called OrderedTJ, to match ordered XML twig queries.

Tatarinov et al [52] proposed a new XML order encoding method, which is called

Dewey Order, based on Dewy Decimal Classification developed for general knowledge

classification [3] Lu et al [39] proposed a novel labeling schema based on Dewey ID

[52], which is called extended Dewey ID Given the extended Dewey label of an element,

the names of all ancestors can be known by finite state transduce (FST ) Hence the

algorithm only scans the elements which appear as leaf nodes of the twig pattern query

There are two main techniques, structural index and labeling scheme, to facilitate

the XML queries The structural index approaches can help to traverse the

hierar-chy of XML The labeling scheme approaches can efficiently determine the

ancestor-descendant and parent-child relationships between any two elements of an XML

docu-ment

Trang 25

DataGuides [24] derives and uses schema information to rewrite queries and guide

the search It records information on the existing paths in a database, using the

in-formation as an index DataGuides are restricted to a single regular expression and

are not useful in more complex queries with several regular expressions The 1-index

[42] is an accurate structural summary that considers incoming paths up to the root

of the whole graph The method computes simulation and bisimulation sets of graph

to partition data nodes Path expressions can be directly evaluated in the index graph

and can retrieve label-matching nodes without referring to the original data graph The

A(k)-index [34] introduces the notion of k-bisimilarity to capture the local structures

of a data graph The A(k)-index can accurately support all path expressions of length

up to k However, path expressions longer than k must be validated in the data graph.

D(k)-index [16] is proposed to improve 1-index and A(k)-index It possesses the

adaptive ability to adjust its structure according to the current query load D(k)-index

allows different index nodes to have different local similarity requirements that can be

tailored to support a given set of frequently used path expressions D(k)-index forces

all index nodes with the same label to have the same similarity It is unnecessary and

may cause the size of the index to increase unnecessary Later, M(k)-index and

M*(k)-index [27] are designed to improve D(k)-M*(k)-index M(k)-M*(k)-index allows different k values

for different nodes and is never over-refined for irrelevant index or data nodes;

M*(k)-index maintains k-bisimilarity information for all k up to some desired maximum and

can avoid over-refinement due to overqualified parents

Kaushik et al [32] proposeed the Forward and Backward-Index (F &B-Index ) to

Trang 26

cover all branching path expression queries It is the smallest covering index for

Branch-ing Path Queries(BPQ) Ramanan [48] defined Simulation, Bisimulation, and Quotient

on an XML document to determine the smallest covering indexes for two subclasses of

BPQ, namely BP Q+ and T P Q Because F &B-Index is proposed as a memory-based

index while its size is usually large in practice, Wang et al [55] presented a disk-based

F &B-Index, which stores a tree onto the disk and analyzes index access patterns and

stores data that is frequently accessed together close on the disk too

Previous indexes focus on covering all path expressions of an XML document

Re-cently, the XR-tree is proposed [29] for indexing XML data based on the region

en-coding, i.e (start, end, level) An XR-tree is basically a B+-tree (built on the start

field of all indexed elements) augmented with stab lists and bookkeeping information

in internal nodes Kaushik et al [33] proposed a strategy that integrates structure

indexes with information-retrieval style inverted list An algorithm for branching path

expressions based on this strategy is introduced and IR-style ranking is employed

Some methods mentioned above build indexes on labeled XML data and they mainly

focus on static XML documents Some approaches have been proposed to label dynamic

XML data Wu et al [56] used prime numbers to label XML trees Based on a

top-down approach, each node is given a unique prime number (self label) and the label of

each node is the product of its parent node’s label (parent labe) and its own self label.

O’Neil et al [43] proposed ORDPATH labeling method which uses the odd numbers at

the initial labeling It uses the even number between two odd numbers to concatenate

another odd number when the XML document is updated However, this approach

Trang 27

can not completely avoid the re-labeling due to the overflow problem Li and Ling

[36] proposed a novel quaternary encoding approach (QED) for the labeling schemes.

Based this encoding method, any exiting labeling method can be improved and any

exiting nodes need not be re-labeled when the update is performed

Some researchers have shown interests in sequence-based XML indexing aiming

at avoiding expensive join operations in XML query processing Wang et al [54]

proposed ViST, a novel index structure which consists of two parts: the D-Ancestor

index and the S-Ancestor index, to index on structure and content together It uses one

sequence of string to represent the XML document and uses another sequence string to

represent the query It converts the query matching problem to subsequence matching

between the document sequence and the query sequence This method does not need

to disassemble query twig pattern and join intermediate result

Rao et al [50] developed a system called PRIX for indexing XML documents and

processing twig queries PRIX transforms labeled XML documents into Pr¨ufer [47]

sequences and uses B+-tree indexing sequences However, though the two methods

avoid join operations in query processing, to eliminate false alarm and false dismissal,

they resort to time consuming operations (post-processing for false alarm and multiple

isomorphism queries processing for false dismissal [53])

Trang 28

2.3 XML Filtering

Recently, a large number of researches have focused on publish-subscribe (pub-sub)

systems based on XML document filtering [9, 20, 21, 22, 26, 35] An XML filtering

engine aims to provide fast matching of XML-encoded data to large number of query

specifications containing constraints on both structure and content

XFilter [9] was the first such system proposed It uses Finite State Machine (FSM )

to represent path expressions in which location steps of path expressions are mapped to

machine states Arriving XML documents are then parsed with an event-based parser;

the events raised during parsing are used to drive the FSM s through their various

transitions A query is said to match a document if during parsing, an accepting state

for that query is reached

One problem with XFilter is that it creates a separate FSM for each individual

query, in a large system where many queries are similar Such construct results in

huge amount of redundant processing, which slows down the filtering processing and

also makes the system less scalable Realizing that shared processing for structure

matching is critical for high-performance XML filtering, quite a number schemes are

proposed to improve the XFilter [15, 20, 44].

In particular, the YFilter system proposed by Diao et al [20] combines all of the

XPath queries into a single Nondeterministic Finite Automaton (NFA) that behaves as

follows: (i) the NFA identifies the exact ”language” defined by the union of all input

Trang 29

path queries; (ii) when an output state is reached, the NFA outputs all matches for

the queries accepted at such state It exploits commonality among queries by merging

common prefixes of the query paths such that they are processed at most once The

resulting shared processing provides tremendous improvements in structure matching

performance YFilter handles twig patterns by decomposing them into linear paths

and then performing post-processing over linear path matches Hence, YFilter is not

optimal for non-path queries such as twig queries

FiST [35] is proposed to perform ordered holistic matching of twig patterns with

incoming documents It employs the Pr¨ufer sequence [47] for an XML document Its

algorithm involves two phases: Progressive Subsequence Matching and Refinement for

Branch Node Verification A new data structure Runtime Global Stack is introduced

to store the tags along the path from the current tag being processed to the root of

the document Given a set of XPath expressions, FiST only identifies those XPath

expressions that appear in a given XML document

2.4 Multiple XML queries processing

Index-Filter [12] is proposed to answer multiple XML simple path queries Different

from previous XML filtering system, Index-Filter aims to find all matches of multiple

single path queries in an XML document Index-based and navigation-based query

processing strategies can be implied in their general scenario In this paper, the

rep-resentation of positions of XML elements introduced in [59] is used In addition, a

Trang 30

B-tree index is built on the tags to provide efficient access to the indexes of individual

tags To eliminate redundant processing, it identifies query commonalities and combine

multiple queries into a single structure, called prefix tree It generalizes the PathStack

algorithm of [13], and takes advantage of prefix tree representation of the set of XML

path queries to share computation during multiple query evaluation Figure 2.1 shows

four XPath queries and their prefix tree

*

D B

Q1

Q1 = /A//B/C/D Q2 = /B/D Q3 = /A//C//D

D E

C

D

Q2

Figure 2.1: Xpath queries and their prefix tree

But Index-Filter can not process multiple twig queries efficiently It has to

decom-pose one twig pattern into several simple XPath queries and process them individually,

then merge them to get the final results for the twig query Given two queries as shown

in Figure 2.2(a), Index-Filter has to decompose Q1 into two simple path queries Q11

and Q12; then it combines the three queries into the prefix tree as shown in Figure

2.2(c) Against the XML document as shown in Figure 2.2(d), Q11, Q12 and Q2 are

matched queries In fact, Q1 does not match the document Obviously, Index-Filter

Trang 31

will identify many useless simple XPath queries when processing multiple twig queries.

A

B

Q11

Q11 = /A//B/C/D Q12 = /A//B/E Q2 = /A//E/F

E

F E C

Therefore, based on the previous review, many researches have presented how to index

XML documents and match XML twig queries and how to find whether multiple XML

twig patterns occur in an XML document, but no research has focused on finding all

occurrences of multiple XML twig queries against an XML document with holistic

approach

Trang 32

We model XML documents as ordered trees, each node corresponding to an element, an

attribute, or a value, and the edges representing (direct) subelement,

element-value or attribute-element-value relationships Each node is assigned a label (start:end, level)

based on its position in the data tree, and each text value is assigned a label that has

the same start and end values [12, 13, 57] Figure 3.1 shows an example XML data

tree The labeling model can be easily extended to multiple documents by introducing

document ID information

Structural relationships between tree nodes (elements, attributes or values) whose

positions are labeled with containment labeling scheme encoding can be determined

easily:

19

Trang 33

0:1000,0 bib

1:40,1 book

6:13,3

author

14:21,3 author

3,3

XML

26:28,3 title

27,4 Xml

29:38,3 section

30:32,4 title 34:37,4 keyword

33,5 XML index

36,5 index

41:82,1 book

42:44,2 title

43,3 Java

58:81,2 chapter

59:61,3 title

60,4 Socket

62:80,3 section

5:22,2

authors

23:25,2 year

24,3 2004

46:53,3 author

45:54,2 authors

55:57,2 year

56,3 2003

15:17,3 fn 18:20,3 ln

48,4 Jack

51,4 Lee

47:49,3 fn 50:52,3 ln

Figure 3.1: An example XML tree with region codes

• ancestor-descendant (A-D): element u is an ancestor of element v if

u.start < v.start and u.end > v.end;

• parent-child (P-C): element u is an parent of element v if

u.start < v.start, u.end > v.end and u.level + 1 = v.level.

3.2 Twig Pattern and Twig Pattern Matching

Queries in XML query languages make use of twig patterns to match relevant portions

of data in an XML database The twig pattern node may be an element tag, a text

value or a wildcard “∗” The query twig pattern edges are either parent-child edges

(depicted using a single line) or ancestor-descendant edges (depicted using a double

line) Now, we give some definitions about twig patterns

Trang 34

Definition 1 A tree t is a tuple (r t , N t , E t ), where:

• ℵ is an alphabet of nodes, N t ⊆ ℵ is the set of nodes of t;

• N p 0 ⊆ N p ;

• the edge (n i , n j ) belongs to P C p 0 iff n i ∈ N p 0 , n j ∈ N p 0 and (n i , n j ) ∈ P C p ;

• the edge (n i , n j ) belongs to AD p 0 iff n i ∈ N p 0 , n j ∈ N p 0 and (n i , n j ) ∈ AD p

Trang 35

In our work, we only consider a fragment of XPath studied in [23], denoted XP {/,//,[ ]},

consisting of the expressions which can be defined recursively by the following grammer:

exp → exp/exp | exp//exp | exp[exp] | σ

where σ is a symbol in an alphabet of node names Then given an XP {/,//,[ ]}expression

e, a twig pattern p corresponding to e can be trivially defined.

For example, the XPath expression A[B/D//F]//C/E[//G/I]/H/J can be

repre-sented by the twig pattern p as shown in Figure 3.2, sp B and sp c are two subpatterns

of p.

A

CB

Figure 3.2: A twig pattern p and its subpatterns sp B and sp C

For convenience, we distinguish between query and data nodes by using the term

node to refer to a query node and the term element to refer to an element, an attribute,

or content value in an XML document

Trang 36

Given a twig pattern p and an XML document D, a match of p in D is identified

by a mapping from the nodes in p to the elements in D, such that:

(i) the query nodes are satisfied by the corresponding elements, attributes,

or values in the XML document;

(ii) the parent-child and ancestor-descendant relationships between query

nodes are satisfied by the corresponding database elements, attributes, and

values

3.3 Holistic Twig Join

The holistic method TwigStack, proposed by Bruno et al [13], is CPU and I/O optimal

for all path patterns and A-D only twig patterns It associates each node q in the twig

query with a stack S q and a stream T q containing all labels in document order of tag q.

Each stream has an imaginary cursor which can either move to the next label or read

the label under it The algorithm operates in two main phases:

(i) TwigJoin, in this phase, a list of labels are output as intermediate results

for each root to leaf path of the twig query;

(ii) Merge, in this phase, the lists of label paths are merged to produce the

final output

When all the edges in the twig query are Ancestor-Descendant edges, TwigStack ensures

that each path output in phase 1 not only matches one path of the twig pattern but also

Trang 37

is part of a match to the entire twig query However, with the presence of Parent-Child

edges in twig patterns, the TwigStack method is no longer optimal.

In this paper, we consider the scenario of matching multiple XML twig queries with

highly similarity against an XML document, which belong to XP {/,//,[ ]}, and focus on

the following problem:

Multiple XML Twig Query Processing: Given an XML document D and a

set of twig queries Q = {q1, , q n }, return the set R= {R1, , R n }, where R i is the

answer (all matches) to q i on D.

We identify query commonalities and combine multiple queries into a single

struc-ture, which is an extension of twig pattern The results returned by the structure

contain the results of all participating queries

Trang 38

Utilizing Commonalities for

Multiple Twigs

4.1 Defining Super-twig

When multiple twig queries are processed simultaneously, it is likely that significant

commonalities between queries exist To eliminate unnecessary processing while

an-swering multiple queries, we identify query commonalities and combine multiple twig

patterns into a single twig pattern, which we call super-twig The super-twig can

sig-nificantly reduce the bookkeeping required to answer input queries, thus reducing the

execution time of query processing

25

Trang 39

4.1.1 Definitions

We will use n (and its variants such as n i) to denote a node in the query or the subtree

whose root is q when there is no ambiguity We extent twig patterns to super-twig

pattern by introducing the concepts OptionalNode and OptionalLeafNode to distinguish

super-twig from general twig patterns.

In this thesis, we only consider the twig patterns belonging to the fragment of XPath

XP {/,//,[ ]}

Definition 4 Given a set of twig queries against an XML document, Q = {q1, ,

q k }, q i ∈ XP {/,//,[ ]} for i = 1, 2, , k; for each query q i , we can use a twig pattern p i

to represent it, such that p i = ht p i , ∅i where t p i = (r p i , N p i , E p i ) is a tree we combine all the twig patterns into a single twig pattern, called super-twig, which is represented

as p s = ht p s , ∅i where t p s = (r p s , N p s , E p s ), such that:

• If there exist any two patterns p i and p j that r p i is not the same as r p j , we rewrite the queries whose root nodes are not the root of the XML document and add the document’s root as the root node of the queries Then the root node of the super twig pattern is the same as the document’s root That is r p s = r p1 = r p2 = =

r p k or r p s equals the document’s root;

• Each twig pattern p i is a subpattern of p s ;

• Suppose n is a query node of p i (n ∈ N p i ) and also is a query node of p j (n ∈ N p j ),

we will give an alias n i for n in p i , and an alias n j for n in p j We will process all

Trang 40

the repeated nodes existing in the patterns p1, , p k for i = 1, 2, , k following this rule; and we denote the new sets of nodes for p1, , p k as N p 01, , N p 0 k for

i = 1, 2, , k Then N p s = N p 01SN p 02S .SN p 0 k ;

• There will be exist repeated nodes in the super twig, but they must not appear as siblings;

• Suppose n is a query node which appears in some twig patterns, p i and p j , where

i 6= j, and the path nodes from the root node r p i to n in p i are (n i1 , , n ix , n), and the path nodes from the root to n in q j are (n j1 , , n jx , n) respectively, where

n i1 = n j1 , n i2 = n j2 , , n ix = n jx Let the parent node of n be m (that is n ix in

p i and n jx in p j ) We denote the edge between m and n as e mn If e mn ∈ P C p i and e mn ∈ AD p j , then e mn ∈ AD p s and the constraint is relaxed; otherwise,

• Following the same situations of point 7, If all the relationships between m and n

Định dạng
Số trang	104
Dung lượng	0,97 MB