The VLDB journal vol 14 issue 1 mar 2005

We then address evaluation algorithms, comparing the appli-cability of various algorithms to the temporal join operators and describing a performance study involving algorithms for one i

Trang 1

Join operations in temporal databases

Dengfeng Gao1, Christian S Jensen2, Richard T Snodgrass1, Michael D Soo3

1 Computer Science Department, P.O Box 210077, University of Arizona, Tucson, AZ 85721-0077, USA

e-mail:{dgao,rts}@cs.arizona.edu

2 Department of Computer Science, Aalborg University, Fredrik Bajers Vej 7E, 9220 Aalborg Ø, Denmark

e-mail: csj@cs.auc.dk

3Amazon.com, Seattle; e-mail: soo@amazon.com

Edited by T Sellis Received: October 17, 2002 / Accepted: July 26, 2003

Published online: October 28, 2003 – c Springer-Verlag 2003

Abstract Joins are arguably the most important relational

operators Poor implementations are tantamount to

comput-ing the Cartesian product of the input relations In a temporal

database, the problem is more acute for two reasons First,

con-ventional techniques are designed for the evaluation of joins

with equality predicates rather than the inequality predicates

prevalent in valid-time queries Second, the presence of

tempo-rally varying data dramatically increases the size of a database

These factors indicate that specialized techniques are needed

to efﬁciently evaluate temporal joins

We address this need for efﬁcient join evaluation in

tempo-ral databases Our purpose is twofold We ﬁrst survey all

previ-ously proposed temporal join operators While many temporal

join operators have been deﬁned in previous work, this work

has been done largely in isolation from competing

propos-als, with little, if any, comparison of the various operators

We then address evaluation algorithms, comparing the

appli-cability of various algorithms to the temporal join operators

and describing a performance study involving algorithms for

one important operator, the temporal equijoin Our focus, with

respect to implementation, is on non-index-based join

algo-rithms Such algorithms do not rely on auxiliary access paths

but may exploit sort orderings to achieve efﬁciency

Keywords: Attribute skew – Interval join – Partition join –

Sort-merge join – Temporal Cartesian product – Temporal join

– Timestamp skew

1 Introduction

Time is an attribute of all real-world phenomena

Conse-quently, efforts to incorporate the temporal domain into

database management systems (DBMSs) have been

ongo-ing for more than a decade [39,55] The potential beneﬁts of

this research include enhanced data modeling capabilities and

more conveniently expressed and efﬁciently processed queries

over time

Whereas most work in temporal databases has

concen-trated on conceptual issues such as data modeling and query

languages, recent attention has been on related issues, most notably indexing and query processingstrategies In this paper, we consider an important subproblem

implementation-of temporal query processing, the evaluation ad hoc temporaljoin operations, i.e., join operations for which indexing or sec-ondary access paths are not available or appropriate Temporalindexing, which has been a proliﬁc research area in its ownright [44], and query evaluation algorithms that exploit suchtemporal indexes are beyond the scope of this paper

Joins are arguably the most important relational operators.This is so because efficient join processing is essential for theoverall efficiency of a query processor Joins occur frequentlydue to database normalization and are potentially expensive tocompute [35] Poor implementations are tantamount to com-puting the Cartesian product of the input relations In a tem-poral database, the problem is more acute Conventional tech-niques are aimed at the optimization of joins with equalitypredicates, rather than the inequality predicates prevalent intemporal queries [27] Moreover, the introduction of a timedimension may significantly increase the size of the database.These factors indicate that new techniques are required to ef-ficiently evaluate joins over temporal relations

This paper aims to present a comprehensive and systematicstudy of join operations in temporal databases, including bothsemantics and implementation Many temporal join operatorshave been proposed in previous research, but little compari-son has been performed with respect to the semantics of theseoperators Similarly, many evaluation algorithms supportingthese operators have been proposed, but little analysis has ap-peared with respect to their relative performance, especially

in terms of empirical study

The main contributions of this paper are the following:

• To provide a systematic classiﬁcation of temporal join

op-erators as natural extensions of conventional join tors

opera-• To provide a systematic classiﬁcation of temporal join

evaluation algorithms as extensions of common relationalquery evaluation paradigms

• To empirically quantify the performance of the temporal

join algorithms for one important, frequently occurring,and potentially expensive temporal operator

Trang 2

Our intention is for DBMS vendors to use the

contribu-tions of this paper as part of a migration path toward

incorpo-rating temporal support into their products Speciﬁcally, we

show that nearly all temporal query evaluation work to date

has extended well-accepted conventional operators and

eval-uation algorithms In many cases, these operators and

tech-niques can be implemented with small changes to an existing

code base and with acceptable, though perhaps not optimal,

performance

Research has identiﬁed two orthogonal dimensions of time

in databases – valid time, modeling changes in the real world,

and transaction time, modeling the update activity of the

database [23,51] A database may support none, one, or both

of the given time dimensions In this paper, we consider only

single-dimension temporal databases, so-called valid-time and

transaction-time databases Databases supporting both time

dimensions, so-called bitemporal databases, are beyond the

scope of this paper, though many of the described techniques

extend readily to bitemporal databases We will use the terms

snapshot, relational, or conventional database to refer to

data-bases that provide no integrated support for time

The remainder of the paper is organized as follows We

propose a taxonomy of temporal join operators in Sect 2

This taxonomy extends well-established relational operators

to the temporal context and classiﬁes all previously deﬁned

temporal operators In Sect 3, we develop a corresponding

taxonomy of temporal join evaluation algorithms, all of which

are non-index-based algorithms The next section focuses on

engineering the algorithms It turns out that getting the details

right is essential for good performance In Sect 5, we

empiri-cally investigate the performance of the evaluation algorithms

with respect to one particular, and important, valid-time join

operator The algorithms are tested under a variety of resource

constraints and database parameters Finally, conclusions and

directions for future work are offered in Sect 6

2 Temporal join operators

In the past, temporal join operators were deﬁned in different

temporal data models; at times the essentially same operators

were even given different names when deﬁned in different

models Further, the existing join algorithms have also been

constructed within the contexts of different data models This

section enables the comparison of join deﬁnitions and

imple-mentations across data models We thus proceed to propose

a taxonomy of temporal joins and then use this taxonomy to

classify all previously deﬁned temporal joins

We take as our point of departure the core set of

conven-tional relaconven-tional joins that have long been accepted as

“stan-dard” [35]: Cartesian product (whose “join predicate” is the

constant expression TRUE), theta join, equijoin, natural join,

left and right outerjoin, and full outerjoin For each of these, we

deﬁne a temporal counterpart that is a natural, temporal

gener-alization of it This genergener-alization hinges on the notion of

snap-shot equivalence [26], which states that two temporal relations

are equivalent if they consist of the same sequence of

time-indexed snapshots We note that some other join operators do

exist, including semijoin, antisemijoin, and difference Their

temporal counterparts have been explored elsewhere [11] and

are not considered here

Having deﬁned this set of temporal joins, we show howall previously deﬁned operators are related to this taxonomy

of temporal joins The previous operators considered includeCartesian product,Θ-JOIN, EQUIJOIN, NATURAL

JOIN, TIME JOIN [6,7], TE JOIN, TE OUTERJOIN, andEVENT JOIN[20,46,47,52] and those based on Allen’s [1]interval relations ([27,28,36]) We show that many of theseoperators incorporate less restrictive predicates or use spe-cialized attribute semantics and thus are variants of one of thetaxonomic joins

2.1 Temporal join deﬁnitions

To be speciﬁc, we base the deﬁnitions on a single data model

We choose the model that is used most widely in ral data management implementations, namely, the one thattimestamps each tuple with an interval We assume that thetimeline is partitioned into minimal-duration intervals, termed

tempo-chronons [12], and we denote intervals by inclusive starting

and ending chronons

We deﬁne two temporal relational schemas, R and S, as

follows

R = (A1, , A n ,Ts ,Te)

S = (B1, , B m ,Ts ,Te)

The A i 1 ≤ i ≤ n and B i 1 ≤ i ≤ m are the explicit

and Teare the timestamp start and end attributes, recordingwhen the information recorded by the explicit attributes holds(or held or will hold) true We will use T as shorthand for theinterval[Ts ,Te ] and A and B as shorthand for {A1, , A n }

instances of R and S, respectively.

Example 1 Consider the following two temporal relations The

relations show the canonical example of employees, the partments they work for, and the managers who supervisethose departments

de-EmployeeEmpName Dept T

2.2 Cartesian product

The temporal Cartesian product is a conventional Cartesianproduct with a predicate on the timestamp attributes To deﬁne

it, we need two auxiliary deﬁnitions

First, intersect(U, V ), where U and V are

inter-vals, returns TRUE if there exists a chronon t such that

Trang 3

t ∈ U ∧ t ∈ V Second, overlap(U, V ) returns the

max-imum interval contained in its two argument intervals If no

nonempty intervals exist, the function returns∅ To state this

more precisely, letﬁrst and last return the smallest and largest

of two argument chronons, respectively Also, let U s and U e

denote, respectively, the starting and ending chronons of U ,

and similarly for V

Deﬁnition 1 The temporal Cartesian product, r ×Ts, of two

temporal relations r and s is deﬁned as follows.

r ×Ts = {z (n+m+2) | ∃x ∈ r ∃y ∈ s (

z [A] = x[A] ∧ z[B] = y[B] ∧

z [T] = overlap(x[T], y[T]) ∧ z[T] = ∅)}

The second line of the deﬁnition sets the explicit attribute

val-ues of the result tuple z to the concatenation of the explicit

attribute values of x and y The third line computes the

time-stamp of z and ensures that it is nonempty

Example 2 Consider the query “Show the names of employees

and managers where the employee worked for the company

while the manager managed some department in the

com-pany.” This can be satisﬁed using the temporal Cartesian

prod-uct

Employee ×TManager

The overlap function is necessary and sufﬁcient to ensure

snapshot reducibility, as will be discussed in detail in Sect 2.7.

Basically, we want the temporal Cartesian product to act as

though it is a conventional Cartesian product applied

inde-pendently at each point in time When operating on

interval-stamped data, this semantics corresponds to an intersection:

the result will be valid during those times when contributing

tuples from both input relations are valid.

The temporal Cartesian product was ﬁrst deﬁned by Segev

and Gunadhi [20,47] This operator was termed the time

join, and the abbreviation T-join was used Clifford and

Croker [7] deﬁned a Cartesian product operator that is a

com-bination of the temporal Cartesian product and the temporal

outerjoin, to be deﬁned shortly Interval join is a building block

of the (spatial) rectangle join [2] The interval join is a

one-dimensional spatial join that can thus be used to implement

the temporal Cartesian product

2.3 Theta join

Like the conventional theta join, the temporal theta join

sup-ports an unrestricted predicate P on the explicit attributes of

its input arguments The temporal theta join, r 1T

P s, of two

relations r and s selects those tuples from r ×Tsthat satisfy

predicate P (r[A], s[B]) Let σ denote the standard selection

A form of this operator, the Θ-JOIN, was proposed by

Clifford and Croker [6] This operator was later extended toallow computations more general than overlap on the time-stamps of result tuples [53]

2.4 Equijoin

Like snapshot equijoin, the temporal equijoin operator forces equality matching among speciﬁed subsets of the ex-plicit attributes of the input relations

en-Deﬁnition 3 The temporal equijoin on two temporal relations

r and s on attributes A ⊆ A and B ⊆ B is deﬁned as the

theta join with predicate P ≡ r[A ] = s[B ]

Like the temporal theta join, the temporal equijoin wasﬁrst deﬁned by Clifford and Croker [6] A specialized oper-ator, the TE-join, was developed independently by Segevand Gunadhi [47] The TE-join required the explicit joinattribute to be a surrogate attribute of both input relations.Essentially, a surrogate attribute would be a key attribute of

a corresponding nontemporal schema In a temporal context,

a surrogate attribute value represents a time-invariant object

identiﬁer If we augment schemas R and S with surrogate tributes ID, then the TE-join can be expressed using the

at-temporal equijoin as follows

r[ ID ]=s[ ID]sThe temporal equijoin was also generalized by Zhang et al

to yield the generalized TE-join, termed the GTE-join, which

specifies that the joined tuples must have their keys in a ified range while their intervals should intersect a specifiedinterval [56] The objective was to focus on tuples within in-teresting rectangles in the key-time space

spec-2.5 Natural join

The temporal natural join and the temporal equijoin bear thesame relationship to one another as their snapshot counter-parts That is, the temporal natural join is simply a temporalequijoin on identically named explicit attributes followed by

a subsequent projection operation

To deﬁne this join, we augment our relation schemas with

explicit join attributes, C i 1 ≤ i ≤ k, which we abbreviate

Trang 4

r 1Ts = {z (n+m+k+2) | ∃x ∈ r ∃y ∈ s(x[C] = y[C]∧

z [A] = x[A] ∧ z[B] = x[B] ∧ z[C] = y[C]∧

z [T] = overlap(x[T], y[T]) ∧ z[T] = ∅)}

The ﬁrst two lines ensure that tuples x and y agree on the values

of the join attributes C and set the explicit attributes of the

result tuple z to the concatenation of the nonjoin attributes A

and B and a single copy of the join attributes, C The third line

computes the timestamp of z as the overlap of the timestamps

of x and y and ensures that x[T] and y[T] actually overlap

This operator was ﬁrst deﬁned by Clifford and Croker [6],

who named it the natural time join We showed in

ear-lier work that the temporal natural join plays the same

impor-tant role in reconstructing normalized temporal relations as the

snapshot natural join for normalized snapshot relations [25]

Most previous work in temporal join evaluation has addressed,

either implicitly or explicitly, the implementation of the

tem-poral natural join or the closely related temtem-poral equijoin

2.6 Outerjoins and outer Cartesian products

Like the snapshot outerjoin, temporal outerjoins and Cartesian

products retain dangling tuples, i.e., tuples that do not

partic-ipate in the join However, in a temporal database, a tuple

may dangle over a portion of its time interval and be covered

over others; this situation must be accounted for in a temporal

outerjoin or Cartesian product

We may deﬁne the temporal outerjoin as the union of two

subjoins, like the snapshot outerjoin The two subjoins are the

temporal left outerjoin and the temporal right outerjoin As the

left and right outerjoins are symmetric, we deﬁne only the left

outerjoin

We need two auxiliary functions The coalesce function

collapses value-equivalent tuples – tuples with mutually equal

nontimestamp attribute values [23] – in a temporal relation

into a single tuple with the same nontimestamp attribute

val-ues and a timestamp that is the ﬁnite union of intervals that

precisely contains the chronons in the timestamps of the

value-equivalent tuples (A ﬁnite union of time intervals is termed

a temporal element [15], which we represent in this paper as

a set of chronons.) The deﬁnition of coalesce uses the

func-tion chronons that returns the set of chronons contained in the

argument interval

∃x∈ r(z[A] = x[A] ⇒ chronons(x[T]) ⊆ z[T]∧

The second and third lines of the deﬁnition coalesce all

value-equivalent tuples in relation r The last line ensures that no

spurious chronons are generated

We now deﬁne a function expand that returns the set of

maximal intervals contained in an argument temporal element,

The second line ensures that a member of the result is an

interval contained in T The last two lines ensure that the

interval is indeed maximal

We are now ready to deﬁne the temporal left outerjoin

Let R and S be deﬁned as for the temporal equijoin We use

Deﬁnition 5 The temporal left outerjoin, r 1T

r[A ]=s[B ]s, of

two temporal relations r and s is deﬁned as follows.

r[A ]=s[B ]s = {z (n+m+2) |

∃x ∈ coalesce(r) ∃y ∈ coalesce(s)

(x[A ] = y[B ] ∧ z[A] = x[A] ∧ z[T] = ∅ ∧ ((z[B] = y[B] ∧ z[T] ∈ expand(x[T] ∩ y[T])) ∨ (z[B] = null ∧ z[T] ∈ expand(x[T] − y[T])))) ∨

∃x ∈ coalesce(r) ∀y ∈ coalesce(s)

(x[A ] = y[B ] ⇒ z[A] = x[A] ∧ z[B] = null ∧

z [T] ∈ expand(x[T ]) ∧ z[T] = ∅)}

The first five lines of the definition handle the case where,

for a tuple x deriving from the left argument, a tuple y with

matching explicit join attribute values is found For those time

intervals of x that are not shared with y, we generate tuples with null values in the attributes of y The ﬁnal three lines of the deﬁnition handle the case where no matching tuple y is found Tuples with null values in the attributes of y are generated

The temporal outerjoin may be deﬁned as simply the union

of the temporal left and the temporal right outerjoins (the unionoperator eliminates the duplicate equijoin tuples) Similarly,

a temporal outer Cartesian product is a temporal outerjoin

without the equijoin condition (A = B = ∅).

Gunadhi and Segev were the ﬁrst researchers to gate outerjoins over time They deﬁned a specialized version

investi-of the temporal outerjoin called the EVENT JOIN [47] Thisoperator, of which the temporal left and right outerjoins werecomponents, used a surrogate attribute as its explicit join at-tribute This deﬁnition was later extended to allow any at-tributes to serve as the explicit join attributes [53] A spe-cialized version of the left and right outerjoins called theTE-outerjoinwas also deﬁned The TE-outerjoin in-corporated the TE-join, i.e., temporal equijoin, as a com-ponent

Clifford and Croker [7] deﬁned a temporal outer Cartesianproduct, which they termed simply Cartesian product

2.7 Reducibility

We proceed to show how the temporal operators reduce tosnapshot operators Reducibility guarantees that the seman-tics of the snapshot operator is preserved in its more complextemporal counterpart

For example, the semantics of the temporal natural joinreduces to the semantics of the snapshot natural join in thatthe result of ﬁrst joining two temporal relations and then trans-forming the result to a snapshot relation yields a result that isthe same as that obtained by ﬁrst transforming the arguments

to snapshot relations and then joining the snapshot relations.This commutativity diagram is shown in Fig 1 and stated for-mally in the ﬁrst equality of the following theorem

Trang 5

Fig 1 Reducibility of temporal

nat-ural join to snapshot natnat-ural join

The timeslice operation τTtakes a temporal relation r as

argument and a chronon t as parameter It returns the

corre-sponding snapshot relation, i.e., with the schema of r but

with-out the timestamp attributes, that contains (the nontime stamp

portion of) all tuples x from r for which t belongs to x[T] It

follows from the theorem below that the temporal joins deﬁned

here reduce to their snapshot counterparts

Theorem 1 Let t denote a chronon and let r and s be relation

instances of the proper types for the operators they are applied

to Then the following hold for all t.

Proof: An equivalence is shown by proving its two

inclu-sions separately The nontimestamp attributes of r and s are

AC and BC, respectively, where A, B, and C are sets of

at-tributes and C denotes the join attribute(s) (cf the deﬁnition

of temporal natural join) We prove one inclusion of the ﬁrst

equivalence, that is, τT

t (r 1 s) (the left-hand side of the

equiv-alence to be proved) Then there is a tuple x ∈ r 1Ts

such that x [ABC] = x and t ∈ x [T] By the deﬁnition

of 1T, there exist tuples x1 ∈ r and x2 ∈ s such that

We have deﬁned a taxonomy for temporal join operators The

taxonomy was constructed as a natural extension of

corre-sponding snapshot database operators We also brieﬂy

de-scribed how previously deﬁned temporal operators are

accom-modated in the taxonomy

Table 1 summarizes how previous work is represented

in the taxonomy For each operator deﬁned in

previ-ous work, the table lists the deﬁning publication,

re-searchers, the corresponding taxonomy operator, and any

restrictions assumed by the original operators In earlywork, Clifford [8] indicated that an INTERSECTION JOINshould be deﬁned that represents the categorized nonouterjoins and Cartesian products, and he proposed that a UNIONJOINbe deﬁned for the outer variants

3 Evaluation algorithms

In the previous section, we described the semantics of all viously proposed temporal join operators We now turn ourattention to implementation algorithms for these operators Asbefore, our purpose is to enumerate the space of algorithmsapplicable to the temporal join operators, thereby providing

pre-a consistent frpre-amework within which existing temporpre-al joinevaluation algorithms can be placed

Our approach is to extend well-understood paradigmsfrom conventional query evaluation to temporal databases.Algorithms for temporal join evaluation are necessarily morecomplex than their snapshot counterparts Whereas snapshotevaluation algorithms match input tuples based on their ex-plicit join attributes, temporal join evaluation algorithms typ-ically must additionally ensure that temporal restrictions aremet Furthermore, this problem is exacerbated in two ways.Timestamps are typically complex data types, e.g., intervalsrequiring inequality predicates, which conventional query pro-cessors are not optimized to handle Also, a temporal database

is usually larger than a corresponding snapshot database due

to the versioning of tuples

We consider non-index-based algorithms Index-based gorithms use an auxiliary access path, i.e., a data structure thatidentiﬁes tuples or their locations using a join attribute value.Non-index-based algorithms do not employ auxiliary accesspaths While some attention has been focused on index-basedtemporal join algorithms, the large number of temporal in-dexes that have been proposed in the literature [44] precludes

al-a thorough investigal-ation in this pal-aper

We ﬁrst provide a taxonomy of temporal join algorithms.This taxonomy, like the operator taxonomy of Table 1, is based

on well-established relational concepts Sections 3.2 and 3.3describe the algorithms in the taxonomy and place existingwork within the given framework Finally, conclusions areoffered in Sect 3.4

3.1 Evaluation taxonomy

All binary relational query evaluation algorithms, includingthose computing conventional joins, are derived from four

Trang 6

Table 1 Temporal join operators

Operator Initial citation Taxonomy operator Restrictions

Cartesian product [7] Outer Cartesian product None

Restrictions:

1 = restricts also the valid time of the result tuples

2 = matching only on surrogate attributes

3 = includes also intersection predicates with an argument surrogate range and a time range

basic paradigms: nested-loop, partitioning, sort-merge, and

index-based [18]

Partition-based join evaluation divides the input tuples into

buckets using the join attributes of the input relations as key

values Corresponding buckets of the input relations contain

all tuples that could possibly match with one another, and

the buckets are constructed to best utilize the available main

memory buffer space The result is produced by performing

an in-memory join of each pair of corresponding buckets from

the input relations

Sort-merge join evaluation also divides the input relation

but uses physical memory loads as the units of division The

memory loads are sorted, producing sorted runs, and written to

disk The result is produced by merging the sorted runs, where

qualifying tuples are matched and output tuples generated

Index-based join evaluation utilizes indexes deﬁned on the

join attributes of the input relations to locate joining tuples

ef-ﬁciently The index could be preexisting or built on the ﬂy

Elmasri et al presented a temporal join algorithm that

uti-lizes a two-level time index, which used a B+-tree to index

the explicit attribute in the upper level, with the leaves

ref-erencing other B+-trees indexing time points [13] Son and

Elmasri revised the time index to require less space and used

this modiﬁed index to determine the partitioning intervals in a

partition-based timestamp algorithm [52] Bercken and Seeger

proposed several temporal join algorithms based on a

multi-version B+-tree (MVBT) [4] Later Zhang et al described

several algorithms based on B+-trees, R∗-trees [3], and the

MVBT for the related GTE-join This operation requires that

joined tuples have key values that belong to a speciﬁed range

and have time intervals that intersect a speciﬁed interval [56]

The MVBT assumes that updates arrive in increasing time

or-der, which is not the case for valid-time data We focus on

non-index-based join algorithms that apply to both valid-time

and transaction-time relations, and we do not discuss these

index-based joins further

We adapt the basic non-index-based algorithms

(nested-loop, partitioning, and sort-merge) to support

temporal joins To enumerate the space of temporal join

algo-rithms, we exploit the duality of partitioning and sort-merge[19] In particular, the division step of partitioning, wheretuples are separated based on key values, is analogous to themerging step of sort-merge, where tuples are matched based

on key values In the following, we consider the istics of sort-merge algorithms and apply duality to derivecorresponding characteristics of partition-based algorithms.For a conventional relation, sort-based join algorithms or-der the input relation on the input relations’ explicit join at-tributes For a temporal relation, which includes timestampattributes in addition to explicit attributes, there are four pos-sibilities for ordering the relation First, the relation can besorted by the explicit attributes exclusively Second, the rela-tion can be ordered by time, using either the starting or endingtimestamp [29,46] The choice of starting or ending timestampdictates an ascending or descending sort order, respectively.Third, the relation can be ordered primarily by the explicit at-tributes and secondarily by time [36] Finally, the relation can

character-be ordered primarily by time and secondarily by the explicitattributes

By duality, the division step of partition-based algorithmscan partition using any of these options [29,46] Hence fourchoices exist for the dual steps of merging in sort-merge orpartitioning in partition-based methods

We use this distinction to categorize the different proaches to temporal join evaluation The ﬁrst approach above,using the explicit attributes as the primary matching attributes,

we term explicit algorithms Similarly, we term the second proach timestamp algorithms We retain the generic term tem-

ap-poral algorithm to mean any algorithm to evaluate a temap-poral

operator

Finally, it has been recognized that the choice of bufferallocation strategy, GRACE or hybrid [9], is independent ofwhether a sort-based or partition-based approach is used [18].Hybrid policies retain most of the last run of the outer relation

in main memory and so minimize the ﬂushing of intermediatebuffers to disk, thereby potentially decreasing the I/O cost.Figure 2 lists the choices of sort-merge vs partitioning,the possible sorting/partitioning attributes, and the possible

Trang 7



×

GRACEHybrid

Fig 2 Space of possible evaluation algorithms

buffer allocation strategies Combining all possibilities yields

16 possible evaluation algorithms Including the basic

nested-loop algorithm and GRACE and hybrid variants of the

sort-based interval join mentioned in Sect 2.2 results in a total

of 19 possible algorithms The 19 algorithms are named and

described in Table 2

We noted previously that time intervals lack a natural

or-der From this point of view spatial join is similar because

there is no natural order preserving spatial closeness Previous

work on spatial join may be categorized into three approaches

Early work [37,38] used a transformation approach based on

space-ﬁlling curves, performing a sort-merge join along the

curve to solve the join problem Most of the work falls in the

index-based approaches, utilizing spatial index structures such

as the R-tree [21], R+-tree [48], R∗-tree [3], Quad-tree [45],

or seeded tree [31] While some algorithms use preexisting

indexes, others build the indexes on the ﬂy

In recent years, some work has focused on

non-index-based spatial join approaches Two partition-non-index-based

spa-tial join algorithms have been proposed One of them

[32] partitions the input relations into overlapping buckets and

uses an indexed nested-loop join to perform the join within

each bucket The other [40] partitions the input relations into

disjoint partitions and uses a computational-geometry-based

plane-sweep algorithm that can be thought of as the spatial

equivalent of the sort-merge algorithm Arge et al [2]

intro-duced a highly optimized implementation of the

sweeping-based algorithm that ﬁrst sorts the data along the vertical axis

and then partitions the input into a number of vertical strips

Data in each strip can then be joined by an internal

plane-sweep algorithm All the above non-index-based spatial join

algorithms use a sort- or partition-based approach or combine

these two approaches in one algorithm, which is the approach

we adopt in some of our temporal join algorithms (Sect 4.3.2)

In the next two sections, we examine the space of explicit

algorithms and timestamp algorithms, respectively, and

clas-sify existing approaches using the taxonomy developed in this

section We will see that most previous work in temporal join

evaluation has centered on timestamp algorithms However,

for expository purposes, we ﬁrst examine those algorithms

based on manipulation of the nontimestamp columns, which

we term “explicit” algorithms

3.2 Explicit algorithms

Previous work has largely ignored the fact that conventional

query evaluation algorithms can be easily modiﬁed to

eval-uate temporal joins In this section, we show how the three

paradigms of query evaluation can support temporal join

eval-uation To make the discussion concrete, we develop an

algo-rithm to evaluate the valid-time natural join, deﬁned in Sect 2,

for each of the three paradigms We begin with the simplest

paradigm, nested-loop evaluation

for each tuple x ∈ b r

for each tuple y ∈ b s

The algorithm operates as follows One relation is

des-ignated the outer relation, the other the inner relation

[35,18] The outer relation is scanned once For each block

of the outer relation, the inner relation is scanned When ablock of the inner relation is read into memory, the tuples

in that “inner block” are joined with the tuples in the “outerblock.”

The temporal nested-loop join is easily constructed fromthis basic algorithm All that is required is that the timestamppredicate be evaluated at the same time as the predicate on theexplicit attributes Figure 3 shows the temporal algorithm (In

the ﬁgure, r is the outer relation and s is the inner relation We

assume their schemas are as deﬁned in Sect 2.)While conceptually simple, nested-loop-based evaluation

is often not competitive due to its quadratic cost We nowdescribe temporal variants of the sort-merge and partition-based algorithms, which usually exhibit better performance

3.2.2 Sort-merge-based algorithms

Sort-merge join algorithms consist of two phases In the ﬁrst

phase, the input relations r and s are sorted by their join

at-tributes In the second phase, the result is produced by

simulta-neously scanning r and s, merging tuples with identical values

for their join attributes

Complications arise if the join attributes are not key tributes of the input relations In this case, multiple tuples in

at-r and in s may have identical join attribute values Hence a given r tuple may join with many s tuples, and vice versa (This is termed skew [30].)

As before, we designate one relation as the outer relationand the other as the inner relation When consecutive tuples in

Trang 8

Table 2 Algorithm taxonomy

Hybrid explicit sort ES-H Hybrid sort-merge by explicit attributes

Hybrid timestamp sort TS-H Hybrid sort-merge by timestamps

Explicit/timestamp sort ETS GRACE sort-merge by explicit attributes/time

Hybrid explicit/timestamp sort ETS-H Hybrid sort-merge by explicit attributes/time

Timestamp/explicit sort TES GRACE sort-merge by time/explicit attributes

Hybrid timestamp/explicit sort TES-H Hybrid sort-merge by time/explicit attributes

Hybrid interval join TSI-H Hybrid sort-merge by timestamps

Explicit partitioning EP GRACE partitioning by explicit attributes

Hybrid explicit partitioning EP-H Hybrid partitioning by explicit attributes

Timestamp partitioning TP Range partition by time

Hybrid timestamp partitioning TP-H Hybrid range partitioning by time

Explicit/timestamp partitioning ETP GRACE partitioning by explicit attributes/time

Hybrid explicit/timestamp partitioning ETP-H Hybrid partitioning by explicit attributes/time

Timestamp/explicit partitioning TEP GRACE partitioning by time/explicit attributes

Hybrid timestamp/explicit partitioning TEP-H Hybrid partitioning by time/explicit attributes

structure state

integer current block;

integer current tuple;

integer first block;

integer first tuple;

block tuples;

Fig 4 State structure for merge scanning

the outer relation have identical values for their explicit join

at-tributes, i.e., their nontimestamp join atat-tributes, the scan of the

inner relation is “backed up” to ensure that all possible matches

are found Prior to showing the explicitSortMerge

al-gorithm, we deﬁne a suite of algorithms that manage the scans

of the input relations For each scan, we maintain the state

structure shown in Fig 4 The ﬁelds current block and

cur-rent tuple together indicate the curcur-rent tuple in the scan by

recording the number of the current block and the index of

the current tuple within that block The ﬁelds ﬁrst block and

ﬁrst tuple are used to record the state at the beginning of a

scan of the inner relation in order to back up the scan later

if needed Finally, tuples stores the block of the relation

cur-rently in memory For convenience, we treat the block as an

array of tuples

The initState algorithm shown in Fig 5 initializes the

state of a scan Essentially, counters are set to guarantee that

the first block read and the first tuple scanned are the first

block and ﬁrst tuple within that block in the input relation We

assume that a seek operation is available that repositions the

ﬁle pointer associated with a relation to a given block number

The advance algorithm advances the scan of the

argu-ment relation and state to the next tuple in the sorted relation

If the current block has been exhausted, then the next block of

the relation is read Otherwise, the state is updated to mark the

next tuple in the current block as the next tuple in the scan The

initState(relation, state):

state.current block ← 1;

state.current tuple ← 0;

state.first block ←⊥;

state.first tuple ←⊥;

seek(relation, state.current block);

state.tuples ← read block(relation);

advance(relation, state):

if (state.current tuple = MAX TUPLES)

state.current block ← state.current block + 1; state.current tuple ← 1;

else

state.current tuple ← state.current tuple + 1;

currentTuple(state):

return state.tuples[state.current tuple]

backUp(relation, state):

if (state.current block = state.first block)

state.current block ← state.first block;

seek(relation, state.current block);

state.current tuple ← state.first tuple;

markScanStart(state):

state.first block ← state.current block;

state.first tuple ← state.current tuple;

Fig 5 Merge algorithms

Trang 9

backUp(s , inner state);

y ← currentTuple(s , inner state);

Fig 6 explicitSortMerge algorithm

current tuplealgorithm merely returns the next tuple in

the scan, as indicated by the scan state Finally, the backUp

and markScanStart algorithms manage the backing up of

the inner relation scan The backUp algorithm reverts the

current block and tuple counters to their last values These

values are stored in the state at the beginning of a scan by the

markScanStartalgorithm

We are now ready to exhibit the explicitSortMerge

algorithm, shown in Fig 6 The algorithm accepts three

pa-rameters, the input relations r and s and the join attributes C.

We assume that the schemas of r and s are as given in Sect 2.

Tuples from the outer relation are scanned in order For each

outer tuple, if the tuple matches the previous outer tuple, the

scan of the inner relation is backed up to the ﬁrst matching

inner tuple The starting location of the scan is recorded in

case backing up is needed by the next outer tuple, and the

scan proceeds forward as normal The complexity of the

al-gorithm, as well as its performance degradation as compared

with conventional sort-merge, is due largely to the

bookkeep-ing required to back up the inner relation scan We consider

this performance hit in more detail in Sect 4.2.2

Segev and Gunadhi developed three algorithms based on

explicit sorting, differing primarily by the code in the inner

loop and by whether backup is necessary Two of the

algo-rithms, TEJ-1 and TEJ-2, support the temporal equijoin [46];

the remaining algorithm, EJ-1, evaluates the temporal join [46]

outer-TEJ-1 is applicable if the equijoin condition is on thesurrogate attributes of the input relations The surrogate at-tributes are essentially key attributes of a corresponding snap-shot schema TEJ-1 assumes that the input relations are sortedprimarily by their surrogate attributes and secondarily by theirstarting timestamps The surrogate matching, sort-ordering,and 1TNF assumption described in Sect 3.3.1 allows the re-sult to be produced with a single scan of both input relations,with no backup

The second equijoin algorithm, TEJ-2, is applicable whenthe equijoin condition involves any explicit attributes, surro-gate or not TEJ-2 assumes that the input relations are sortedprimarily by their explicit join attribute(s) and secondarily bytheir starting timestamps Note that since the join attribute can

be a nonsurrogate attribute, tuples sharing the same join tribute value may overlap in valid time Consequently, TEJ-2requires the scan of the inner relation to be backed up in order

at-to ﬁnd all tuples with matching explicit attributes

For the EVENT JOIN, Segev and Gunadhi described thesort-merge-based algorithm EJ-1 EJ-1 assumes that the inputrelations are sorted primarily by their surrogate attributes andsecondarily by their starting timestamps Like TEJ-1, the result

is produced by a single scan of both input relations

3.2.3 Partition-based algorithms

As in sort-merge-based algorithms, partition-based algorithmshave two distinct phases In the first phase, the input relationsare partitioned based on their join attribute values The par-titioning is performed so that a given bucket produced fromone input relation contains tuples that can only match withtuples contained in the corresponding bucket of the other in-put relation Each produced bucket is also intended to fill theallotted main memory Typically, a hash function is used asthe partitioning agent Both relations are filtered through thesame hash function, producing two parallel sets of buckets Inthe second phase, the join is computed by comparing tuples incorresponding buckets of the input relations Partition-basedalgorithms have been shown to have superior performancewhen the relative sizes of the input relations differ [18]

A partitioning algorithm for the temporal natural join isshown in Fig 7 The algorithm accepts as input two relations

r and s and the names of the explicit join attributes C We assume that the schemas of r and s are as given in Sect 2.

As can be seen, the explicit partition-based join algorithm

is conceptually very simple One relation is designated theouter relation, the other the inner relation After partitioning,each bucket of the outer relation is read in turn For a given

“outer bucket,” each page of the corresponding “inner bucket”

is read, and tuples in the buffers are joined

The partitioning step in Fig 7 is performed by thepartitionalgorithm This algorithm takes as its ﬁrst argu-

ment an input relation The resulting n partitions are returned

in the remaining parameters Algorithm partition assumes that

a hash function hash is available that accepts the join attribute

values x[C] as input and returns an integer, the index of the

target bucket, as its result

Trang 10

outer bucket ← read partition(r i);

for each page p ∈ s i

p ← read page(s i);

for each tuple x ∈ outer bucket

for each tuple y ∈ p

In contrast to the algorithms of the previous section, timestamp

algorithms perform their primary matching on the timestamps

associated with tuples

In this section, we enumerate, to the best of our

knowl-edge, all existing timestamp-based evaluation algorithms for

the temporal join operators described in Sect 3 Many of these

algorithms assume sort ordering of the input by either their

starting or ending timestamps While such assumptions are

valid for many applications, they are not valid in the general

case, as valid-time semantics allows correction and deletion

of previously stored data (Of course, in such cases one could

resort within the join.) As before, all of the algorithms

de-scribed here are derived from nested loop, sort-merge, or

par-titioning; we do not consider index-based temporal joins

3.3.1 Nested-loop-based timestamp algorithms

One timestamp nested-loop-based algorithm has been

pro-posed for temporal join evaluation Like the EJ-1 algorithm

described in the previous section, Segev and Gunadhi

devel-oped their algorithm, EJ-2, for the EVENT JOIN [47,20]

(Ta-ble 1)

EJ-2 does not assume any ordering of the input relations

It does assume that the explicit join attribute is a distinguished

surrogate attribute and that the input relations are in TemporalFirst Normal Form (1TNF) Essentially, 1TNF ensures thattuples within a single relation that have the same surrogatevalue may not overlap in time

EJ-2 simultaneously produces the natural join and left erjoin in an initial phase and then computes the right outerjoin

out-in a subsequent phase

For the ﬁrst phase, the inner relation is scanned once fromfront to back for each outer relation tuple For a given outerrelation tuple, the scan of the inner relation is terminated whenthe inner relation is exhausted or the outer tuple’s timestamphas been completely overlapped by matching inner tuples Theouter tuple’s natural join is produced as the scan progresses.The outer tuple’s left outerjoin is produced by tracking thesubintervals of the outer tuple’s timestamp that are not over-lapped by any inner tuples An output tuple is produced foreach subinterval remaining at the end of the scan Note thatthe main memory buffer space must be allocated to containthe nonoverlapped subintervals of the outer tuple

In the second phase, the roles of the inner and outer tions are reversed Now, since the natural join was producedduring the ﬁrst phase, only the right outerjoin needs to becomputed The right outerjoin tuples are produced in the samemanner as above, with one small optimization If it is knownthat a tuple of the (current) outer relation did not join withany tuples during the ﬁrst phase, then no scanning of the innerrelation is required and the corresponding outerjoin tuple isproduced immediately

rela-Incidentally, Zurek proposed several algorithms for

eval-uating temporal Cartesian product on multiprocessors based

on nested loops [57]

3.3.2 Sort-merge-based timestamp algorithms

To date, four sets of researchers – Segev and Gunadhi, Leungand Muntz, Pfoser and Jensen, and Rana and Fotouhi – havedeveloped timestamp sort-merge algorithms Additionally, aone-dimensional spatial join algorithm proposed by Arge et

al can be used to implement a temporal Cartesian product.Segev and Gunadhi modiﬁed the traditional merge-joinalgorithm to support the T-join and the temporal equijoin [47,20] We describe the algorithms for each of these operators inturn

For the T-join, the relations are sorted in ascending order

of starting timestamp The result is produced by a single scan

of the input relations

For the temporal equijoin, two timestamp sorting rithms, named TEJ-3 and TEJ-4, are presented Both TEJ-3and TEJ-4 assume that their input relations are sorted by start-ing timestamp only TEJ-4 is applicable only if the equijoincondition is on the surrogate attribute In addition to assumingthat the input relations are sorted by their starting timestamps,TEJ-4 assumes that all tuples with the same surrogate valueare linked, thereby allowing all tuples with the same surrogate

algo-to be retrieved when the ﬁrst is found The result is performedwith a linear scan of both relations, with random access needed

to traverse surrogate chains

Like TEJ-2, TEJ-3 is applicable for temporal equijoins

on both the surrogate and explicit attribute values TEJ-3 sumes that the input relations are sorted in ascending order of

Trang 11

as-their starting timestamps, but no sort order is assumed on the

explicit join attributes Hence TEJ-3 requires that the inner

relation scan be backed up should consecutive tuples in the

outer relation have overlapping interval timestamps

Leung and Muntz developed a series of algorithms based

on the sort-merge algorithm to support temporal join

predi-cates such as “contains” and “intersect” [1] Although their

algorithms do not explicitly support predicates on

nontem-poral attribute values, their techniques are easily modiﬁed to

support more complex join operators such as the temporal

equijoin Like Segev and Gunadhi, this work describes

evalu-ation algorithms appropriate for different sorting assumptions

and access paths

Leung and Muntz use a stream-processing approach

Ab-stractly, the input relations are considered as sequences of

time-sorted tuples where only the tuples at the front of the

streams may be read The ordering of the tuples is a tradeoff

with the amount of main memory needed to compute the join

For example, Leung and Muntz show how a contain join [1]

can be computed if the input streams are sorted in ascending

order of their starting timestamp They summarize for various

sort orders on the starting and ending timestamps what tuples

must be retained in main memory during the join

computa-tion A family of algorithms are developed assuming different

orderings (ascending/descending) of the starting and ending

timestamps

Leung and Muntz also show how checkpoints, essentially

the set of tuples valid during some chronon, can be used to

evaluate temporal joins where the join predicate implies some

overlap between the participating tuples Here, the

check-points actually contain tuple identiﬁers (TIDs) for the tuples

valid during the speciﬁed chronon and the TIDs of the next

tuples in the input streams Suppose a checkpoint exists at

time t Using this checkpoint, the set of tuples participating

in a join over a time interval containing t can be computed by

using the cached TIDs and “rolling forward” using the TIDs

of the next tuples in the streams

Rana and Fotouhi proposed several techniques to improve

the performance of time-join algorithms in which they claimed

they used a nested-loop approach [43] Since they assumed the

input relations were sorted by the start time and/or end time,

those algorithms are more like the second phase of

sort-merge-based timestamp algorithms The algorithms are very similar

to the sort-merge-based algorithms developed by Segev and

Gunadhi

Arge et al described the interval join, a

one-dimensional spatial join algorithm, which is a building block

of a two-dimensional rectangle join [2] Each interval is

de-ﬁned by a lower boundary and an upper boundary The problem

is to report all intersections between an interval in the outer

relation and an interval in the inner relation If the interval

is a time interval instead of a spatial interval, this problem is

equivalent to the temporal Cartesian product They assumed

the two input relations were ﬁrst sorted by the algorithm into

one list by their lower boundaries The algorithm maintains

two initially empty lists of tuples with “active” intervals, one

for each input relation When the sorted list is scanned, the

current tuple is put into the active list of the relation it

be-longs to and joins only with the tuples in the active list of the

other relation Tuples becoming inactive during scanning are

removed from the active list

Most recently, Pfoser and Jensen [41] applied the merge approach to the temporal theta join in a setting whereeach argument relation consists of a noncurrent and a currentpartition Tuples in the former all have intervals that end be-fore the current time, while all tuples of the latter have intervalsthat end at the current time They assume that updates arrive intime order, so that tuples in noncurrent partitions are ordered

sort-by their interval end times and tuples in current partitions areordered by their interval start times A join then consists ofthree different kinds of subjoins They develop two join algo-rithms for this setting and subsequently use these algorithmsfor incremental join computation

As can be seen from the above discussion, a large ber of timestamp-based sort-merge algorithms have been pro-posed, some for speciﬁc join operators However, each of theseproposals has been developed largely in isolation from otherwork, with little or no cross comparison Furthermore, pub-lished performance ﬁgures have been derived mainly fromanalytical models rather than from empirical observations Anempirical comparison, as provided in Sect 5, is needed to trulyevaluate the different proposals

num-3.3.3 Partition-based timestamp algorithms

Partitioning a relation over explicit attributes is relativelystraightforward if the partitioning attributes have discrete val-ues Partitioning over time is more difﬁcult since our time-stamps are intervals, i.e., range data, rather than discrete val-ues Previous timestamp partitioning algorithms therefore de-

veloped various means of range partitioning the time intervals

associated with tuples

In previous work, we described a valid-time join algorithmusing partitioning [54] This algorithm was presented in thecontext of evaluating the valid-time natural join, though it iseasily adapted to other temporal joins The range partitioningused by this algorithm mapped tuples to singular buckets anddynamically migrated the tuples to other buckets as neededduring the join computation This approach avoided data re-dundancy, and associated I/O overhead, at the expense of morecomplex buffer management

Sitzmann and Stuchey extended this algorithm by usinghistograms to decide the partition boundary [49] Their algo-rithm takes the number of long-lived tuples into consideration,which renders its performance insensitive to the number oflong-lived tuples However, it relies on a preexisting temporalhistogram

Lu et al described another range-partitioning algorithmfor computing temporal joins [33] This algorithm is applica-ble to theta joins, where a result tuple is produced for each pair

of input tuples with overlapping valid-time intervals Their proach is to map intervals to a two-dimensional plane, which

ap-is then partitioned into regions The join result ap-is produced

by computing the subjoins of pairs of partitions ing to adjacent regions in the plane This method applies to

correspond-a restricted temporcorrespond-al model where future time is not correspond-allowed.They utilize a spatial index to speed up the joining phase

Trang 12

Table 3 Existing algorithms and taxonomy counterparts

TEJ-1 Segev and Gunadhi Explicit/timestamp sort Surrogate attribute and 1TNF

TEJ-2 Segev and Gunadhi Explicit/timestamp sort None

EJ-2 Segev and Gunadhi Nested-loop Surrogate attribute and 1TNF

EJ-1 Segev and Gunadhi Explicit/timestamp sort Surrogate attribute and 1TNF

Time-join Segev and Gunadhi Timestamp sort None

TEJ-3 Segev and Gunadhi Timestamp sort None

TEJ-4 Segev and Gunadhi Timestamp sort Surrogate attribute/access chain

Several Leung and Muntz Timestamp sort None

Two Pfoser and Jensen Timestamp sort Partitioned relation; time-ordered updates

– Sitzmann and Stuckey Timestamp partition Requires preexisting temporal histogram

– Lu et al Timestamp partition Disallows future time; uses spatial index

3.4 Summary

We have surveyed temporal join algorithms and proposed

a taxonomy of such algorithms The taxonomy was

devel-oped by adapting well-established relational query evaluation

paradigms to the temporal operations

Table 3 summarizes how each temporal join operation

pro-posed in previous work is classiﬁed in the taxonomy We

be-lieve that the framework is complete since, disregarding

data-model-speciﬁc considerations, all previous work naturally ﬁts

into one of the proposed categories

One important property of an algorithm is whether it

deliv-ers a partial answer before the entire input is read Among the

algorithms listed in Table 3, only the nested-loop algorithm

has this property Partition-based algorithms have to scan the

whole input relation to get the partitions Similarly, sort-based

algorithms have to read the entire input to sort the relation

We note, however, that it is possible to modify the temporal

sort-based algorithms to be nonblocking, using the approach

of progressive merge join [10].

4 Engineering the algorithms

As noted in the previous section, an adequate empirical

inves-tigation of the performance of temporal join algorithms has

not been performed We concentrate on the temporal equijoin,

deﬁned in Sect 2.4 This join and the related temporal natural

join are needed to reconstruct normalized temporal relations

[25] To perform a study of implementations of this join, we

must ﬁrst provide state-of-the-art implementations of the 19

different types of algorithms outlined for this join In this

sec-tion, we discuss our implementation choices

4.1 Nested-loop algorithm

We implemented a simple block-oriented nested-loop

algo-rithm Each block of the outer relation is read in turn into

memory The outer block is sorted by the explicit joining

at-tribute (actually, pointers are sorted to avoid copying of

tu-ples) Each block of the inner relation is then brought into

memory For a given inner block, each tuple in that block is

joined by binary searching the sorted tuples

This algorithm is simpler than the nested-loop algorithm,EJ-2, described in Sect 3.3.1 [20,47] In particular, our al-gorithm computes only the valid-time equijoin, while EJ-2computes the valid-time outerjoin, which includes the equi-join in the form of the valid-time natural join However, ouralgorithm supports a more general equijoin condition thanEJ-2 in that we support matching on any explicit attributerather than solely on a designated surrogate attribute

by generating many small, fully sorted runs, then repeatedly

merges these into increasingly longer runs until a single run

is obtained (this is done for the left-hand side and right-handside independently) Each step of the sort phase reads andwrites the entire relation The merge phase then scans the fullysorted left-hand and right-hand relations to produce the outputrelation A common optimization is to stop the sorting phaseone step early, when there is a small number of fully sortedruns The ﬁnal step is done in parallel with the merge phase

of the join, thereby avoiding one read and one write scan.Our sort-merge algorithms implemented for the performanceanalysis are based on this optimization We generated initialruns using an in-memory quicksort on the explicit attributes(ES and ES-H), the timestamp attributes (TS and TS-H), orboth (ETS and ETS-H) and then merged the two relations onmultiple runs

4.2.2 Efﬁcient skew handling

As noted in Sect 3.2.2, sort-merge join algorithms becomecomplicated when the join attributes are not key attributes

Our previous work on conventional joins [30] shows that

in-trinsic skew is generally present in this situation Even a small

amount of intrinsic skew can result in a signiﬁcant mance hit because the naive approach to handling skew is to

Trang 13

reread the previous tuples in the same value packet (containing

the identical values for the equijoin attribute); this rereading

in-volves additional I/O operations We previously proposed

sev-eral techniques to handle skew efﬁciently [30] Among them,

SC-n (spooled cache on multiple runs) was recommended due

to its strikingly better performance in the presence of skew for

both conventional and band joins This algorithm also exhibits

virtually identical performance as a traditional sort-merge join

in the absence of skew SC-n uses a small cache to hold the

skewed tuples from the right-hand relation that satisfy the join

condition At the cache’s overﬂow point, the cache data are

spooled to disk

Skew is prevalent in temporal joins SC-n can be adapted

for temporal joins by adding a supplemental predicate

(re-quiring that the tuples overlap) and calculating the resulting

timestamps, by intersection We adopt this spooled cache in ES

instead of rereading the previous tuples The advantage of

us-ing spooled cache is shown in Fig 8 ES Reread is the multirun

version of the explicitSortMerge algorithm exhibited

in Sect 3.2.2, which backs up the right-hand relation when a

duplicate value is found in the left-hand relation

The two algorithms were executed in theTimeIt

sys-tem The parameters are the same as those that will be used

in Sect 5.1 In this experiment, the memory size was ﬁxed at

8 MB and the cache size at 32 KB The relations were generated

with different percentages of smooth skew on the explicit

at-tribute A relation has 1% smooth skew when 1% of the tuples

in the relation have one duplicate value on the join attribute

and the remaining 98% of the tuples have no duplicates Since

the cache can hold the skewed tuples in memory, no additional

I/O is caused by backing up the relation The performance

im-provement of using a cache is approximately 25% when the

data have 50% smooth skew We thus use a spooled cache to

handle skew Spooling will generally not occur but is available

in case a large value packet is present

4.2.3 Time-varying value packets

and optimized prediction rule

ES utilizes a prediction rule to judge if skew is present (Recall

that skew occurs if the two tuples have the same join attribute

value.) The prediction rule works as follows When the last

tuple in the right-hand relation (RHR) buffer is visited, thelast tuple in the left-hand relation (LHR) buffer is checked todetermine if skew is present and the current RHR value packetneeds to be put into the cache

We also implemented an algorithm (TS) that sorts the inputrelations by start time rather than by the explicit join attribute.Here the RHR value packet associated with a speciﬁc LHRtuple is not composed of those RHR tuples with the samestart time but rather of those RHR tuples that overlap withthe interval of the LHR tuple Hence value packets are notdisjoint, and they grow and shrink as one scans the LHR In

particular, TS puts into the cache only those tuples that could

overlap in the future: the tuples that do not stop too early, that

is, before subsequent LHR tuples start For an individual LHRtuple, the RHR value packet starts with the ﬁrst tuple that stopssometime during the LHR tuple’s interval and goes throughthe ﬁrst RHR tuple that starts after the LHR tuple stops Valuepackets are also not totally ordered when sorting by start time.These considerations suggest that we change the predic-tion rule in TS When the RHR reaches a block boundary, themaximum stop time in the current value packet is comparedwith the start time of the last tuple in the LHR buffer If themaximum stop time of the RHR value packet is less than thelast start time of the LHR, none of the tuples in the value packetwill overlap with the subsequent LHR tuples Thus there is noneed to put them in the cache Otherwise, the value packet isscanned and only those tuples with a stop time greater thanthe last start time of the LHR are put into the cache, therebyminimizing the utilization of the cache and thus the possibility

or a value packet into the cache

To make our work complete, we also implemented TES,which sorts the input relations primarily by start time and sec-ondarily by the explicit attribute The logic of TES is exactlythe same as that of TS for the joining phase We expect theextra sorting by explicit attribute will not help to optimize thealgorithm but rather will simply increase the CPU time

4.2.4 Specialized cache purgingSince the cache size is small, it could be ﬁlled up if a valuepacket is very large or if several value packets accumulate inthe cache For the former, nothing but spooling the cache can

be done However, purging the cache periodically can avoidunnecessary cache spool for the latter and may result in fewerI/O operations

Purging the cache costs more in TS since the RHR valuepackets are not disjoint, while in ES they are disjoint both ineach run and in the cache The cache purging process in ESscans the cache from the beginning and stops whenever theﬁrst tuple that belongs to the current value packet is met But

in TS, this purging stage cannot stop until the whole cachehas been scanned because the tuples belonging to the currentvalue packet are spread across the cache An inner long-lived

Trang 14

Fig 9 Performance improvement of using a heap in ES

tuple could be kept in the cache for a long time because its

time interval could intersect with many LHR tuples

4.2.5 Using a heap

As stated in Sect 4.2.1, the ﬁnal step of sorting is done in

parallel with the merging stage Assuming the two relations are

sorted in ascending order, in the merging stage the algorithm

ﬁrst has to ﬁnd the smallest value from the multiple sorted runs

of each relation and then compare the two values to see if they

can be joined The simplest way to ﬁnd the smallest value is

to scan the current value of each run If the relation is divided

into m runs, the cost of selecting the smallest value is O(m).

A more efﬁcient way to do this is to use a heap to select the

smallest value The cost of using a heap is O (log2m) when

At the beginning of the merging step, the heap is built based

on the value of the ﬁrst tuple in each run Whenever advance

is called, the run currently on the top of the heap advances its

reading pointer to the next tuple Since the key value of this

tuple is no less than the tuple in the current state, it should

be propagated down to maintain the heap structure When a

run is backed up, its reading pointer is restored to point to a

previously visited tuple, which has a smaller key value, and

thus should be propagated up the heap

When the memory size is relatively small, which indicates

that the size of each run is small and therefore that a relation

has to be divided into more runs (the number of runs m is

large), the performance of using a heap will be much better

than that without a heap However, using a heap causes some

pointer swaps when sifting down or propagating up a tuple in

the heap, which are not needed in the simple algorithm When

the memory size is sufﬁciently large, the performance of using

a heap will be close to or even worse than that of the simple

algorithm

Figure 9 shows the total CPU time of ES when using and

not using a heap The data used in Fig 9 are two 64-MB

re-lations The input relations are joined while using different

sizes of memory Note that the CPU time includes the time of

both the sorting step and the merging step As expected, the

performance of using a heap is better than that without a heap

when the memory is small The performance improvement

is roughly 40% when the memory size is 2 MB The mance difference decreases as the memory increases Whenthe memory size is greater than 32 MB, which is one half of therelation size, using a heap has no beneﬁt Since using a heapsigniﬁcantly improves the performance when the memory isrelatively small and barely degrades performance when thememory is large, we use a heap in all sort-based algorithms

perfor-4.2.6 GRACE and hybrid variants

We implemented both GRACE and hybrid versions of eachsort-based algorithm In the GRACE variants, all the sortedruns of a relation are written to disk before the merging stage.The hybrid variants keep most of the last run of the outerrelation in memory This guarantees that one (multiblock) diskread and one disk write of the memory-resident part will besaved When the available memory is slightly smaller than thedataset, the hybrid algorithms will require relatively fewer I/Ooperations

4.2.7 Adapting the interval join

We consider the interval join a variant of the timestamp merge algorithm (TS) In this paper, we call it TSI and itshybrid variant TSI-H To be fair, we do not assume the inputrelations are sorted into one list Instead, TSI begins with sort-ing as its ﬁrst step Then it combines the last step of the sortwith the merge step The two active lists are essentially twospooled caches, one for each relation Each cache has the samesize as that in TS This is different from the strategy of keeping

sort-a single block of esort-ach list in the originsort-al psort-aper A smsort-all csort-achecan save more memory for the input buffer, thus reducing therandom reads However, it will cause more cache spools whenskew is present Since timestamp algorithms tend to encounterskew, we choose a cache size that is the same as that in TS,rather than one block

4.3 Partition-based algorithms

Several engineering considerations also occur when menting the partition-based algorithms

imple-4.3.1 Partitioning detailsThe details of algorithm TP are described elsewhere [54] Wechanged TP to use a slightly larger input buffer (32 KB) and acache for the inner relation (also 32 KB) instead of using a one-page buffer and cache The rest of the available main memory

is used for the outer relation There is a tradeoff between alarge outer input buffer and a large inner input buffer andcache A large outer input buffer implies a large partition size,which results in fewer seeks for both relations But the cache

is more likely to spool On the other hand, allocating a largecache and a large inner input buffer results in a smaller outerinput buffer, thus a smaller partition size This will increaserandom I/O We chose 32 KB instead of 1 KB (the page size)

Trang 15

as a compromise The identiﬁcation of the best cache size is

given in Sect 6 as a direction of future research

The algorithms ETP and TEP partition the input relations

in two steps ETP partitions the relations by explicit attribute

ﬁrst For each pair of the buckets to be joined, if none of them

ﬁts in memory, a further partition by timestamp attribute will

be made to these buckets to increase the possibility that the

resulting buckets do not overﬂow the available buffer space

TEP is similar to ETP, except that it partitions the relations in

the reverse order, ﬁrst by timestamp and then, if necessary, by

explicit attribute

4.3.2 Joining the partitions

The partition-based algorithms perform their second phase,

the joining of corresponding partitions of the outer and inner

relations, as follows The outer partition is fetched into

mem-ory, assuming that it will not overﬂow the available buffer

space, and pointers to the outer tuples are sorted using an

in-memory quicksort The inner partition is then scanned, using

all memory pages not occupied by the outer partition For each

inner tuple, matching outer tuples are found by binary search

If the outer partitions overﬂow the available buffer space, then

the algorithms default to an explicit attribute sort-merge join

of the corresponding partitions

4.3.3 GRACE and hybrid variants

In addition to the conventional GRACE algorithm, we

imple-mented the hybrid buffer management for each partition-based

algorithm In the hybrid algorithms, one outer bucket is

des-ignated as memory-resident Its buffer space is increased

ac-cordingly to hold the whole bucket in memory When the inner

relation is partitioned, the inner tuples that map to the

corre-sponding bucket are joined with the tuples in the

memory-resident bucket This eliminates the I/O operations to write

and read one bucket of tuples for both the inner and the outer

relation Similar to the hybrid sort-based algorithms, the

hy-brid partition-based algorithms are supposed to have better

performance when the input relation is slightly larger than the

available memory size

4.4 Supporting the iterator interface

Most commercial systems implement the relational operators

as iterators [18] In this model, each operator is realized by

three procedures called open, next, and close The algorithms

we investigate in this paper can be redesigned to support the

iterator interface

The nested-loop algorithm and the explicit partitioning

al-gorithms are essentially the corresponding snapshot join

algo-rithms except that a supplemental predicate (requiring that the

tuples overlap) and the calculation of the resulting timestamps

are added in the next procedure.

The timestamp partitioning algorithms determine the

peri-ods for partitions by sampling the outer relation and partition

the input relations in the open procedure The next procedure

calls the next procedure of nested-loop join for each pair of

partitions An additional predicate is added in the next

proce-dure to determine if a tuple should be put into the cache.The sort-based algorithms generate the initial sorted runsfor the input relations and merge runs until the ﬁnal merge

step is left in the open procedure In the next procedure, the

inner runs, the cache, and the outer runs are scanned to ﬁnd amatch At the same time, the inner tuple is examined to decide

whether to put it in the cache The close procedure destroys the input runs and deallocates the cache The open and close

procedures of interval join algorithms are the same as the other

sort-based algorithms The next procedure gets the next tuple

from the sorted runs and scans the cache to ﬁnd the matchingtuple and purges the cache at the same time

5 Performance

We implemented all 19 algorithms enumerated in Table 2 andtested their performance under a variety of data distributions,including skewed explicit and timestamp distributions, time-stamp durations, memory allocations, and database sizes Weensured that all algorithms generated exactly the same outputtuples in all of the experiments (the ordering of the tuples willdiffer)

The remainder of this section is organized as follows Weﬁrst give details on the join algorithms used in the experimentsand then describe the parameters used in the experiments Sec-tions 5.2 to 5.9 contain the actual results of the experiments.Section 5.10 summarizes the results of the experiments

5.1 Experimental setup

The experiments were developed and executed using theTimeIT [17] system, a software package supportingthe prototyping of temporal database components UsingTimeIT, we ﬁxed several parameters describing all test re-lations used in the experiments These parameters and theirvalues are shown in Table 4 In all experiments, tuples were

16 bytes long and consisted of two explicit attributes, bothbeing integers and occupying 4 bytes, and two integer times-tamps, each also requiring 4 bytes Only one of the explicitattributes was used as the joining attribute This yields resulttuples that are 24 bytes long, consisting of 16 bytes of explicitattributes from each input tuple and 8 bytes for the timestamps

We ﬁxed the relation size at 64 MB, giving four milliontuples per relation We were less interested in absolute relationsize than in the ratio of input size to available main memory.Similarly, the ratio of the page size to the main memory sizeand the relation size is more relevant than the absolute pagesize A scaling of these factors would provide similar results

In all cases, the generated relations were randomly orderedwith respect to both their explicit and timestamp attributes.The metrics used for all experiments are listed in Table 5 In

a modern computer system, a random disk access takes about

10 ms, whereas accessing a main memory location typicallytakes less than 60 ns [42] It is reasonable that a sequentialI/O takes about one tenth the time of a random I/O Moderncomputer systems usually have hardware data cache, whichcan reduce the CPU time on cache hit Therefore, we chosethe join attribute compare time as 20 ns, which was slightly less

Trang 16

Table 4 System characteristics

Tuples per relation 4 million

Timestamp size ([s,e]) 8 bytes

Explicit attribute size 8 bytes

Relation lifespan 1,000,000 chronons

Output buffer size 32 KB

Cache size in sort-merge 64 KB

Cache size in partitioning 32 KB

Table 5 Cost metrics

Sequential I/O cost 1 ms

Random I/O cost 10 ms

Join attribute compare 20 ns

Timestamp compare 20 ns

Pointer compare 20 ns

Pointer swap 60 ns

than, while in the same magnitude of, the memory access time

The cost metrics we used is the average memory access time

given a high hit ratio (>90%) of cache It is possible that the

CPU cache has lower hit ratio when running some algorithms

However, the magnitude of the memory access time will not

change We assumed that the sizes of both a timestamp and

a pointer were the same as the size of an integer Thus, their

compare times are the same as that of the join attribute A

pointer swap takes three times as long as the pointer compare

time because it needs to access three pointers A tuple move

takes four times as long as the integer compare time since the

size of a tuple is four times that of an integer

We measured both main memory operations and disk I/O

operations To eliminate any undesired system effects from

the results, all operations were counted using facilities

pro-vided byTimeIT For disk operations, random and

sequen-tial access was measured separately We included the cost of

writing the output relation in the experiments since sort-based

and partition-based algorithms exhibit dual random and

se-quential I/O patterns when sorting/coalescing and

partition-ing/merging The total time was then computed by weighing

each parameter by the time values listed in Table 5

Table 6 summarizes the values of the system parameters

that varied among the different experiments Each row of the

table identiﬁes the ﬁgures that illustrate the results of the

ex-periments given the parameters for the experiment The reader

may have the impression that the intervals are so small that

they are almost like the standard equijoin atributes Are there

tuples overlapping each other? In many cases, we performed

the self-join, which guaranteed for each tuple in one relation

that there is at least one matching tuple in the other relation

Long-duration timestamps (100 chronons) were used in two

experiments It guaranteed that there were on average four

tu-ples valid in each chronon Two other experiments examine

the case where one relation has short-duration timestamps and

the other has long-duration timestamps Therefore, our iments actually examined different degrees of overlapping

exper-5.2 Simple experiments

In this section, we perform three “base case” experiments,where the join selectivity is low, i.e., for an equijoin of valid-

time relations r and s, a given tuple x ∈ r joins with one, or

few, tuples y ∈ s The experiments incorporate random data

distributions in the explicit join attributes and short and longtime intervals in the timestamp attributes

5.2.1 Low explicit selectivity with short timestamps

In this experiment, we generated a relation with little explicitmatching and little overlap and joined the relation with itself.This mimics a foreign key-primary key natural join in that thecardinality of the result is the same as one of the input rela-tions The relation size was ﬁxed at 64 MB, corresponding tofour million tuples The explicit joining attribute values wereintegers drawn from a space of231− 1 values For the given

cardinality, a particular explicit attribute value appeared on erage in only one tuple in the relation The starting timestampattribute values were randomly distributed over the relationlifespan, and the duration of the interval associated with eachtuple was set to one chronon We ran each of the 19 algorithmsusing the generated relation, increasing main memory alloca-tions from 2 MB, a 1:32 memory to input size ratio, to 64 MB,

av-a 1:1 rav-atio

The results of the experiment are shown in Fig 10 In eachpanel, the ordering of the legend corresponds to the order ofeither the rightmost points or the leftmost points of each curve.The actual values of each curve in all the ﬁgures may be found

in the appendix of the associated technical report [16] Note

that both the x-axis and the y-axis are log-scaled.As suspected,

nested loop is clearly not competitive The general nested-loopalgorithm performs very poorly in all cases but the highestmemory allocation At the smallest memory allocation, theleast expensive algorithm, EP, enjoys an 88% performanceincrease Only at the highest memory allocation, that is, whenthe entire left-hand side relation can ﬁt in main memory, doesthe nested-loop algorithm have comparable performance withother algorithms Given the disparity in performance and giventhat various characteristics, such as skew or the presence oflong-duration tuples, do not impact the performance of thenested-loop algorithm, we will not consider this algorithm inthe remainder of this section

To get a better picture of the performance of the remainingalgorithms, we plot them separately in Fig 11 From this ﬁgure

on, we eliminate the noncompetitive nested loop We groupthe algorithms that have a similar performance and retain only

a representative curve for each group in the ﬁgures In thisﬁgure, TES-H and TSI-H have performances very similar tothat of TS-H; ETS-H has a performance very similar to that ofES-H; ETP-H, TP-H, and TEP-H have performances similar tothat of EP-H; the remaining algorithms all have a performancesimilar to that of EP

In this graph, only the x-axis is log-scaled The sort-based

and partition-based algorithms exhibit largely the same formance, and the hybrid algorithms outperform their GRACE

Trang 17

per-Table 6 Experiment parameters

Fig 10 Low explicit selectivity,

low timestamp selectivity

counterparts at high memory allocations, in this case when the

ratio of main memory to input size reaches approximately 1:8

(2 MB of main memory) or 1:4 (4 MB of main memory) The

poor performance of the hybrid algorithms stems from

reserv-ing buffer space to hold the resident run/partition, which takes

buffer space away from the remaining runs/partitions, causing

the algorithms to incur more random I/O At small memory

allocations, the problem is acute Therefore, the hybrid group

starts from a higher position and ends in a lower position The

GRACE group behaves in the opposite way

The performance differences between the sort-based

algo-rithms and their partitioning counterparts are small, and there

is no absolute winner TES, the sort-merge algorithm that sorts

the input relation primarily by start time and secondarily by

explicit attribute, has a slightly worse performance than TS,

which sorts the input relation by start time only Since the order

of the start time is not the order of the time interval, the extra

sorting by explicit attribute does not help in the merging step.The program logic is the same as for TS, except for the extrasorting We expect TES will always perform a little worse than

TS Therefore, neither TES nor TES-H will be considered inthe remainder of this section

5.2.2 Long-duration timestamps

In the experiment described in the previous section, the joinselectivity was low since explicit attribute values were sharedamong few tuples and tuples were timestamped with intervals

of short duration We repeated the experiment using duration timestamps The duration of each tuple timestampwas ﬁxed at 100 chronons, and the starting timestamps wererandomly distributed throughout the relation lifespan As be-fore, the explicit join attribute values were randomly dis-

Trang 18

Fig 11 Low explicit selectivity, low timestamp selectivity (without

Fig 12 Low explicit selectivity (long-duration timestamps)

tributed integers; thus the size of the result was just slightly

larger due to the long-duration timestamps

The results are shown in Fig 12, where the x-axis is

log-scaled In this ﬁgure, the group of ES-H and ETS-H are

repre-sented by ES-H; the group of ETP-H, EP-H, TEP-H, and TP-H

by TP-H; the group of TP, TEP, ES, ETS, EP, and ETP by ES;

and the rest are retained The timestamp sorting algorithms,

TS and TS-H, suffer badly Here, the long duration of the tuple

lifespans did not cause overﬂow of the tuple cache used in these

algorithms To see this, recall that our input relation cardinality

was four million tuples For a 1,000,000 chronon relation

lifes-pan, this implies that4, 000, 000/1, 000, 000 = 4 tuples arrive

per chronon Since tuple lifespans were ﬁxed at 100 chronons,

it follows that4 × 100 = 400 tuples should be scanned before

any purging of the tuple cache can occur However, a 64-KB

tuple cache, capable of holding 4000 tuples, does not tend to

overﬂow Detailed examination veriﬁed that the cache never

overﬂowed in these experiments The poor performance of TS

and TS-H are caused by the repeated in-memory processing

of the long-lived tuples

TSI and TSI-H also suffer in the case of long duration but

are better than TS and TS-H when the main memory size is

small TSI improves the performance of TS by 32% at the

smallest memory allocation, while TSI-H improves the formance of TS-H by 13% Our detailed results show that the

per-TS had slightly less I/O time than per-TSI per-TS also saved sometime in tuple moving since it did not move every tuple intocache However, it spent much more time in timestamp com-paring and pointer moving In TSI, each tuple only joined withthe tuples in the cache of the other relation The caches in TSIwere purged during the join process; thus the number of time-stamp comparisons needed by the next tuple was reduced In

TS, an outer tuple joined with both cache tuples and tuples

in the input buffer of the inner relation, and the input bufferwas never purged Therefore, TS had to compare more times-tamps Pointer moving is needed in the heap maintenance,which is used to sort the current tuples in each run TS fre-quently backed up the inner runs inside the inner buffer andscanned tuples in the value packets multiple times In eachscan, the heap for the inner runs had to sort the current innertuples again In TSI, the tuples are sorted once and kept in or-der in the caches Therefore, the heap overhead is small Whenthe main memory size is small, the number of runs is large, asare the heap size and the heap overhead

The timestamp partitioning algorithms, TP and TP-H,have a performance very similar to that described inSect 5.2.1 There are two main causes of the good perfor-mance of TP and TP-H The ﬁrst is that TP does not replicatelong-lived tuples that overlap with multiple partition intervals.Otherwise, TP would need more I/O for the replicated tuples.The second is that TP sorts each partition by the explicit at-tribute The long duration does not have any effect on the per-formance of the in-memory joining All the other algorithmssort or partition the relations by explicit attributes Therefore,their performance is not affected by the long duration

We may conclude from this experiment that the timestampsort-based algorithms are quite sensitive to the durations ofinput tuple intervals When tuple durations are long, the in-memory join in TS and TS-H performs poorly due to the need

to repeatedly back up the tuple pointers

5.2.3 Short- and long-duration timestamps

In the experiments described in the previous two sections, thetimestamps are either short or long for both relations We ex-amined the case where the durations for the two input relationsare different The duration of each tuple timestamp in the outerrelation was ﬁxed at 1 chronon, while the duration of that inthe inner relation was ﬁxed at 100 chronons We carefully gen-erated the two relations so that the outer relation and the innerrelation had a one-to-one relationship For each tuple in theouter relation, there is one tuple in the inner relation that hasthe same value of the explicit attributes and the same start time

as the outer tuple with a long duration instead of a short ration This guaranteed that the selectivity was between that

du-of the two previous experiments As before, the explicit joinattribute values and the start time were randomly distributed

The results are shown in Fig 13, where the x-axis is

log-scaled The groups of the curves are the same as in Fig 12.The relative positions of the curves are similar to those in thelong-duration experiment The performance of the timestampsorting algorithms were even worse than that of the others,but better than that in the experiment where long-duration tu-

Trang 19

Fig 13 Low explicit selectivity (short-duration timestamps join

long-duration timestamps)

ples were in both input relations Long-duration tuples reduce

the size of value packets for each tuple on only one side and

therefore result in fewer timestamp comparisons in all four

timestamp sorting algorithms and fewer backups in TS and

TS-H

We also exchanged the outer and inner relations for this

experiment and observed results identical to those in Fig 13

This indicates that whether the long-duration tuples exist in

the outer relation or the inner relation has little impact on the

performance of any algorithm

5.3 Varying relation sizes

It has been shown for snapshot join algorithms that the

rela-tive sizes of the input relations can greatly affect which

sort-or partition-based strategy is best [18] We investigated this

phenomenon in the context of valid-time databases

We generated a series of relations, increasing in size from

4 MB to 64 MB, and joined them with a 64-MB relation The

memory allocation used in all trials was 16 MB, the size at

which all algorithms performed most closely in Fig 11 As

in the previous experiments, the explicit join attribute

val-ues in all relations were randomly distributed integers

Short-duration timestamps were used to mitigate the in-memory

ef-fects on TS and TS-H seen in Fig 12 As before, starting

timestamps were randomly distributed over the relation

lifes-pan Since the nested-loop algorithm is expected to be a

com-petitor when one of the relations ﬁts in the memory, we

incor-porated this algorithm into this experiment The results of the

experiment are shown in Fig 14 In this ﬁgure, ES represents

all the GRACE sorting algorithms, ES-H all the hybrid sorting

algorithms, EP all the GRACE partitioning algorithms, TP-H

the hybrid timestamp partitioning algorithms, and EP-H the

hybrid explicit partitioning algorithms, and NL is retained

The impact of a differential in relation sizes for the

partition-based algorithms is clear When an input relation is

small relative to the available main memory, the

partition-based algorithms use this relation as the outer relation and

build an in-memory partition table from it The inner relation is

then linearly scanned, and for each inner tuple the in-memory

100 200 300 400 500 600 700

outer relation size (MB)

ES ES-H EP TP-H EP-H

Fig 14 Different relation sizes (short-duration timestamps)

partition table is probed for matching outer tuples The beneﬁt

of this approach is that each relation is read only once, i.e., nointermediate writing and reading of generated partitions oc-curs Indeed, the inner relation is not partitioned at all, furtherreducing main memory costs in addition to I/O savings.The nested-loop algorithm has the same I/O costs aspartition-based algorithms when one of the input relations ﬁts

in the main memory When the size of the smaller input tion is twice as large as the memory size, the performance ofnested-loop algorithms is worse than that of any other algo-rithms This is consistent with the results shown in Fig 10

rela-An important point to note is that this strategy is ﬁcial regardless of the distribution of either the explicit joinattributes and/or the timestamp attributes, i.e., it is unaffected

bene-by either explicit or timestamp skew Furthermore, no similaroptimization is available for sort-based algorithms Since eachinput relation must be sorted, both relations must be read andwritten once to generate sorted runs and subsequently readonce to scan and match joining tuples

To further investigate the effectiveness of this strategy,

we repeated the experiment of Fig 14 with long-durationtimestamps, i.e., tuples were timestamped with timestamps

100 chronons in duration We did not include the nested-loopalgorithm because we did not expect the long-duration tuples

to have any impact on it The results are shown in Fig 15 Thegrouping of the curves in this ﬁgure is slightly different fromthe grouping in Fig 14 in that timestamp sorting algorithmsare separated instead of grouped together

As expected, long-duration timestamps adversely affectthe performance of all the timestamp sorting algorithms forreasons stated in Sect 5.2.2 The performance of TSI and TSI-

H is slightly better than that of TS and TS-H, respectively.This is consistent with the results at 16 MB memory size inFig 12 Replotting the remaining algorithms in Fig 16 showsthat the long-duration timestamps do not signiﬁcantly impactthe efﬁciency of other algorithms

In both the short-duration and the long-duration cases,the hybrid partitioning algorithms show the best performance.They save about half of the I/O operations of their GRACEcounterparts when the size of the outer relation is 16 MB This

is due to the hybrid strategy

Trang 20

ES ES-H EP TP-H EP-H

Fig 16 Different relation sizes (long-duration timestamps, without

TS/TS-H)

We further changed the input relations so that the tuples in

the outer relations have the ﬁxed short duration of 1 chronon

and those in the outer relations have the ﬁxed long duration

of 100 chronons Other features of the input relations remain

the same The reults, as shown in Fig 17, are very similar to

the long-duration case The performance of timestamp sorting

algorithms is slightly better than that in Fig 15 Again, we

re-generated the relations such that the tuples in the outer relation

have the long-duration ﬁxed at 100 chronons and those in the

inner relation have the short-duration ﬁxed at 1 chronon The

results are almost identical to those shown in Fig 17

The graph shows that partition-based algorithms

should be chosen whenever the size of one or both of

the input relations is small relative to the available buffer

space We conjecture that the choice between explicit

par-titioning and timestamp parpar-titioning is largely dependent

on the presence or absence of skew in the explicit and/or

timestamp attributes Explicit and timestamp skew may or

may not increase I/O cost; however, they will increase main

memory searching costs for the corresponding algorithms, as

we now investigate

0 200 400 600 800

TS TSI TS-H TSI-H ES ES-H EP TP-H EP-H

Fig 17 Different relation sizes (short- and long-duration timestamps)

5.4 Explicit attribute skew

As in the experiments described in Sect 5.3, we ﬁxed themain memory allocation at 16 MB to place all algorithms on

a nearly even footing The inner and outer relation sizes wereﬁxed at 64 MB each We generated a series of outer relationswith increasing explicit attribute skew, from 0% to 100% in

20% increments Here we generated data with chunky skew.

The explicit attribute has 20% chunky skew, which indicatesthat 20% of the tuples in this relation have the same explicitattribute value Explicit skew was ensured by generating tu-ples with the same explicit join attribute value Short-durationtimestamps, randomly distributed over the relation lifespan,were used to mitigate the long-duration timestamp effect ontimestamp sorting algorithms The results are shown in Fig 18

In this ﬁgure, TSI, TS, TEP, and TP are represented by TS andtheir hybrid counterparts by TS-H, and other algorithms areretained

There are three points to emphasize in this graph First,the explicit partitioning algorithms, i.e., EP, EP-H, ETP, andETP-H, show increasing costs as the explicit skew increases.The performance of EP and EP-H degrades dramatically withincreasing explicit skew This is due to the overﬂowing of mainmemory partitions, causing subsequent buffer thrashing Theeffect, while pronounced, is relatively small since only one ofthe input relations is skewed Encountering skew in both rela-tions would exaggerate the effect Although the performance

of ETP and ETP-H also degrades, the changes are much lesspronounced The reason is that they employ time partitioning

to reduce the effect of explicit attribute skew

As expected, the group of algorithms that perform sorting

or partitioning on timestamps, TS, TS-H, TP, TP-H, TEP, andTEP-H, have relatively ﬂat performance By ordering or parti-tioning by time, these algorithms avoid effects due to explicitattribute distributions

The explicit sorting algorithms, ES, ES-H, ETS, andETS-H, perform very well In fact, the performance of ESand ES-H increases as the skew increases As the skew in-creases, by default the relations become increasingly sorted.Hence, ES and ES-H expend less effort during run generation

We conclude from this experiment that if high explicitskew is present in one input relation, then explicit sorting,

Trang 21

explicit skew in outer relation (percentage)

timestamp partitioning, and timestamp sorting appear to be the

better alternatives The choice among these is then dependent

on the distribution and the length of tuple timestamps, which

can increase the amount of timestamp skew present in the

input, as we will see in the next experiment

5.5 Timestamp skew

Like explicit attribute distributions, the distribution of

time-stamp attribute values can greatly impact the efﬁciency of the

different algorithms We now describe a study on the effect of

this aspect

As in the experiments described in Sect 5.3, we ﬁxed

the main memory allocation at 16 MB and the sizes of all

input relations at 64 MB We ﬁxed one relation with randomly

distributed explicit attributes and randomly distributed tuple

timestamps, and we generated a series of relations with

in-creasing timestamp attribute chunky skew, from 0% to 100%

in 20% increments The timestamp attribute has 20% chunky

skew, which indicates that 20% of the tuples in the relation

are in one value packet The skew was created by

generat-ing tuples with the same interval timestamp Short-duration

timestamps were used in all relations to mitigate the

long-duration timestamp effect on timestamp sorting algorithms

Explicit join attribute values were distributed randomly The

results of the experiment are shown in Fig 19 In this ﬁgure,

all the GRACE explicit algorithms are represented by EP,

hy-brid explicit sorting algorithms by ES-H, and hyhy-brid explicit

partition algorithms by EP-H; the remaining algorithms are

retained

Four interesting observations may be made First, as

ex-pected, the timestamp partitioning algorithms, i.e., TP, TEP,

TP-H, and TEP-H, suffered increasingly poorer performance

as the amount of timestamp skew increased This skew causes

overﬂowing partitions The performance all four of these gorithms is good when the skew is 100% because TP andTP-H become explicit sort-merge joins and TEP and TEP-Hbecome explicit partition joins Second, TSI and TSI-H alsoexhibited poor performance as the timestamp skew increasedbecause 20% skew in the outer relation caused the outer cache

al-to overﬂow Third, TS and TS-H show increased performance

at the highest skew percentage This is due to the sortedness

of the input, analogous to the behavior of ES and ES-H inthe previous experiment Finally, as expected, the remainingalgorithms have ﬂat performance across all trials

When timestamp skew is present, timestamp partitioning

is a poor choice We expected this result, as it is analogous

to the behavior of partition-based algorithms in conventionaldatabases, and similar results have been reported for temporalcoalescing The interval join algorithms are also bad choiceswhen the amount of timestamp skew is large A small amount

of timestamp skew can be handled efﬁciently by increasingthe cache size in interval join algorithms We will discuss thisissue again in Sect 5.8 Therefore, the two main dangers togood performance are explicit attribute skew and/or timestampattribute skew We investigate the effects of simultaneous skewnext

5.6 Combined explicit/timestamp attribute skew

Again, we ﬁxed the main memory allocation at 16 MB andset the input relation sizes at 64 MB Timestamp durationswere set to 1 chronon to mitigate the long-duration timestampeffect on the timestamp sorting algorithms We then generated

a series of relations with increasing explicit and timestampchunky skew, from 0% to 100% in 20% increments Skew wascreated by generating tuples with the same explicit joiningattribute value and tuple timestamp The explicit skew andthe timestamp skew are orthogonal The results are shown in

Trang 22

timestamp skew in outer relation (percentage)

explicit and timestamp skew in outer relation (percentage)

Fig 20 In this ﬁgure, ETS, ES, and TS are represented by ES;

and ETS-H, ES-H, and TS-H by ES-H; the other algorithms

are retained

The algorithms are divided into three groups in terms of

performance As expected, most of the partition-based

al-gorithms and the interval join alal-gorithms, TEP, TEP-H, TP,

TP-H, EP, EP-H, TSI, and TSI-H, show increasingly poorer

performance as the explicit and timestamp skew increases The

remaining explicit/timestamp sorting algorithms show

rela-tively ﬂat performance across all trials, and the explicit sorting

and timestamp sorting algorithms exhibit increasing

perfor-mance as the skew increases, analogous to their perforperfor-mance

in the experiments described in Sects 5.4 and 5.5 While theelapsed time of ETP and ETP-H increases slowly along withincreasing skew, these two algorithms perform very well This

is analogous to their performance in the experiments described

in Sect 5.4

5.7 Explicit attribute skew in both relations

In previous work [30], we studied the effect of data skew onthe performance of sort-merge joins There are three types ofskew: outer relation skew, inner relation skew, and dual skew

Trang 23

explicit skew in both relations (percentage)

ES

ES-H

EP

TS

Fig 22 Explicit attribute skew in both relations

Outer skew occurs when value packets in the outer relation

cross buffer boundaries Similarly, inner skew occurs when

value packets in the inner relation cross buffer boundaries

Dual skew indicates that outer skew occurs in conjunction with

inner skew While outer skew does not cause any problems for

TS and TS-H, it degrades the performance of TSI and

TSI-H; dual skew degrades the performance of the TS and TS-H

joins In this section, we compare the performance of the join

algorithms in the presence of dual skew in the explicit attribute

The main memory allocation was ﬁxed at 16 MB and the

size of all input relations at 64 MB We generated a series

of relations with increasing explicit attribute chunky skew,

from 0% to 4% in 1% increments To ensure dual skew, we

performed self-join on these relations Short-duration

times-tamps, randomly distributed over the relation lifespan, were

used to mitigate the long-duration timestamp effect on the

timestamp sorting algorithms The results are shown in Fig 21

In this ﬁgure, all the explicit partitioning algorithms are

rep-resented by EP, all the timestamp partitioning algorithms by

TP, all the sort merge algorithms except ES and ES-H by TS,

and ES and ES-H are retained

There are three points to discuss regarding the

graph First, the explicit algorithms, i.e., ES, ES-H, EP,

EP-H, ETP, and ETP-H, suffer when the skew increases

Al-though the numbers of I/O operations of these algorithms crease along with the increasing skew, the I/O-incurred dif-ference between the highest and the lowest skew is only 2 s.The difference in the output relation size between the highestand the lowest skew is only 460 KB, which leads to about a4.6-s performance difference Then what is the real reason forthe performance hit of these algorithms? Detailed examinationrevealed that it is in-memory operations that cause the poorperformance of these algorithms When data skew is present,these algorithms have to do substantial in-memory work toperform the join This is illustrated in Fig 22, which showsthe CPU time used by each algorithm To present the differ-

in-ence clearly, we do not use a log-scale y-axis Note that six

algorithms, i.e., ETS, TSI, TS, ETS-H, TSI-H, and TS-H, havevery low CPU cost (less than 30 s) in all cases So their perfor-mance does not degrade when the degree of skew increases.Second, the performance of the timestamp partitioning al-gorithms, i.e., TP, TP-H, TEP, and TEP-H, degrade with in-creasing skew, but not as badly as do the explicit algorithms.Although timestamp partitioning algorithms sort each parti-tion by the explicit attribute, the explicit attribute inside each

partition is not highly skewed For example, if n tuples have

the same value as the explicit attribute, they will be put intoone partition after being hashed in EP In the join phase, there

will be an n × n loop within the join In TP, this value packet

will be distributed evenly across partitions Assuming there

are m partitions, each partition will have n/m of these tuples, which leads to an n2/m2loop within the join per partition.

The total number of join operations in TP will be n2/m, which

is1/m of that of EP This factor can be seen from Fig 22.

Finally, the timestamp sorting algorithms, i.e., TS,

TS-H, TSI, TSI-TS-H, ETS, and ETS-TS-H, perform very wellunder explicit skew TS and TS-H only use the time-stamp to determine if a backup is needed TSI andTSI-H only use the timestamp to determine if the cache tu-ples should be removed We see the beneﬁt of the secondarysorting on the timestamp in the algorithms ETS and ETS-H.Since these two algorithms deﬁne the value packet by both theexplicit attribute and the timestamp, the big loop in the joinphase is avoided

From this experiment, we conclude that when explicit dualskew is present, all the explicit algorithms are poor choicesexcept for ETS and ETS-H The effects of timestamp dualskew are examined next

5.8 Timestamp dual skew

Like explicit dual skew, timestamp dual skew can affect theperformance of the timestamp sort-merge join algorithms Welook into this effect

We ﬁxed main memory at 16 MB and input relations at

64 MB We generated a series of relations with increasingtimestamp chunky skew, from 0% to 4% in 1% increments

To ensure dual skew, we performed a self-join on these tions Short-duration timestamps, randomly distributed overthe relation lifespan, were used to mitigate the long-durationtimestamp effect on timestamp sorting algorithms The ex-plicit attribute values were also distributed randomly The re-sults are shown in Fig 23 In this ﬁgure, GRACE explicit sortmerge algorithms are represented by ES; all hybrid partition-

Trang 24

timestamp skew in both relations (percentage)

Fig 24 Timestamp attribute skew in both relations

ing algorithms by EP-H, TSI and TSI-H by TSI, and TEP, TP,

ETP, EP, ETS-H, and ES-H by EP; the remaining algorithms

are retained

The algorithms fall into three groups All the timestamp

sort-merge algorithms exhibit poor performance However, the

performance of TS and TS-H is much better than that of TSI

and TSI-H At the highest skew, the performance of TS is

174 times better than that of TSI This is due to the cache

overﬂow in TSI One percent of 64 MB is 640 KB, which is

ten times the cache size The interval join algorithm scans and

purges the cache once for every tuple to be joined The cache

thrashing occurs when the cache overﬂows As before, there

is no cache overﬂow in TS and TS-H The performance gap

between these two algorithms and the group with ﬂat curves is

caused by in-memory join operations The CPU time used by

each algorithm is plotted separately in Fig 24 In this ﬁgure,

all the explicit sort-merge algorithms are represented by ES,

all the explicit partitioning algorithms by EP, all the timestamp

partitioning algorithms by TP, and TSI and TSI-H by TSI ; the

remaining algorithms are retained Since all but timestamp

sort-merge algorithms perform the in-memory join by sorting

the relations or the partitions on the explicit attribute, their

performance is not at all affected by dual skew

400 600 1000 2000 5000 10000 50000 200000 500000 1100000

explicity/timestamp skew in both relations (percentage)

TSI ES ES-H EP

Fig 25 Explicit/timestamp attribute skew in both relations

It is interesting that the CPU time spent by TSI is less thanthat spent by TS The poor overall performance of TSI due tothe cache overﬂow can be improved by increasing the cachesize of TSI Actually, TSI performs the join operation in thetwo caches rather than in the input buffers Therefore, a largecache size can be chosen when dual skew is present to avoidcache thrashing In this case, a 1-MB cache size for TSI willresult in a performance similar to that of TS

5.9 Explicit/timestamp dual skew

In this section, we investigate the simultaneous effect of dualskew in both the explicit attribute and the timestamp This is

a challenging situation for any temporal join algorithm.The memory size is 16 MB, and we generated a series

of 64-MB relations with increasing explicit and timestampchunky skew, from 0% to 4% in 1% increments Dual skewwas guaranteed by performing a self-join on these relations.The results are shown in Fig 25 In this ﬁgure, TSI and TSI-Hare represented by TSI, TS and ES by ES, TS-H and ES-H byTS-H, all the explicit partitioning algorithms by EP, and theremaining algorithms by TP

The interesting point is that all the algorithms are affected

by the simultaneous dual skew in both the explicit and stamp attributes But they fall into two groups The algorithmsthat are sensitive to the dual skew in either explicit attribute

time-or timestamp attribute perftime-orm as badly as they do in the periments described in Sects 5.7 and 5.8 The performance

of the algorithms not affected by the dual skew in either plicit attribute or timestamp attribute degrades with increasingskew However, their performance is better than that of the al-gorithms in the ﬁrst group This is due to the orthogonality ofthe explicit skew and the timestamp skew

ex-5.10 Summary

The performance study described in this section is the ﬁrstcomprehensive, empirical analysis of temporal join algo-rithms We investigated the performance of 19 non-index-based join algorithms, namely, nested-loop (NL), explicit

Trang 25

partitioning (EP and EP-H), explicit sorting (ES and

ES-H), timestamp sorting (TS and TS-ES-H), interval join (TSI

and TSI-H), timestamp partitioning (TP and TP-H),

com-bined explicit/timestamp sorting (ETS and ETS-H) and

time-stamp/explicit sorting (TES and TES-H), and combined

ex-plicit/timestamp partitioning (ETP and ETP-H) and

time-stamp/explicit partitioning (TEP and TEP-H) for the temporal

equijoin We varied the following main aspects in the

experi-ments: the presence of long-duration timestamps, the relative

sizes of the input relations, and the explicit-join and timestamp

attribute distributions

The ﬁndings of this empirical analysis can be summarized

as follows

• The algorithms need to be engineered well to avoid

perfor-mance hits Care needs to be taken in sorting, in purging

the cache, in selecting the next tuple in the merge step, in

allocating memory, and in handling intrinsic skew

• Nested-loop is not competitive.

• The timestamp sorting algorithms, TS, TS-H, TES,

TES-H, TSI, and TSI-H, were also not competitive They

were quite sensitive to the duration of input tuple

times-tamps TSI and TSI-H had very poor performance in the

presence of large amounts of skew due to cache overﬂow

• The GRACE variants were competitive only when there

was low selectivity and a large memory size relative to the

size of the input relations In all other cases, the hybrid

variants performed better

• In the absence of explicit and timestamp skew, our results

parallel those from conventional query evaluation In

par-ticular, when attribute distributions are random, all

sort-ing and partitionsort-ing algorithms (other than those already

eliminated as noncompetitive) have nearly equivalent

per-formance, irrespective of the particular attribute type used

for sorting or partitioning

• In contrast with previous results in temporal coalescing

[5], the binary nature of the valid-time equijoin allows

an important optimization for partition-based algorithms

When one input relation is small relative to the available

main memory buffer space, the partitioning algorithms

have uniformly better performance than their sort-based

counterparts

• The choice of timestamp or explicit partitioning depends

on the presence or absence of skew in either attribute

di-mension Interestingly, the performance differences are

dominated by main memory effects The timestamp

parti-tioning algorithms were less affected by increasing skew

• ES and ES-H were sensitive to explicit dual skew.

• The performance of the partition-based algorithms, EP and

EP-H, was affected by both outer and dual explicit attribute

skew

• The performance of TP and TP-H degraded when

outer skew was present Except for this one

situ-ation, these partition-based algorithms are generally

more efﬁcient than their sort-based counterparts since

sorting, and associated main memory operations, are

avoided

• It is interesting that the combined

explicit/timestamp-based algorithms can mitigate the effect of either explicit

attribute skew or timestamp skew However, when dual

skew was present in the explicit attribute and the timestamp

simultaneously, the performance of all the algorithms graded, though again less so for timestamp partitioning

de-6 Conclusions and research directions

As a prelude to investigating non-index-based temporal joinevaluation, this paper initially surveyed previous work, ﬁrstdescribing the different temporal join operations proposed inthe past and then describing join algorithms proposed in previ-ous work The paper then developed evaluation strategies forthe valid-time equijoin and compared the evaluation strategies

in a sequence of empirical performance studies The speciﬁccontributions are as follows

• We deﬁned a taxonomy of all temporal join operators

pro-posed in previous research The taxonomy is a natural one

in the sense that it classiﬁes the temporal join operators asextensions of conventional operators, irrespective of spe-cial joining attributes or other model-speciﬁc restrictions.The taxonomy is thus model independent and assigns aname to each temporal operator consistent with its exten-sion of a conventional operator

• We extended the three main paradigms of query

evalua-tion algorithms to temporal databases, thereby deﬁning thespace of possible temporal evaluation algorithms

• Using the taxonomy of temporal join algorithms, we

de-ﬁned 19 temporal equijoin algorithms, representing thespace of all such possible algorithms, and placed all exist-ing work into this framework

• We deﬁned the space of database parameters that affect the

performance of the various join algorithms This space ischaracterized by the distribution of the explicit and time-stamp attributes in the input relation, the duration of times-tamps in the input relations, the amount of main memoryavailable to the join algorithm, the relative sizes of theinput relations, and the amount of dual attribute and/ortimestamp skew for each of the relations

• We empirically compared the performance of the

algo-rithms over this parameter space

Our empirical study showed that some algorithms can beeliminated from further consideration: NL, TS, TS-H, TES,TES-H, ES, ES-H, EP, and EP-H Hybrid variants generallydominated GRACE variants, eliminating ETP, TEP, and TP.When the relation sizes were different, explicit sorting (ETS,ETS-H, ES, ES-H) performed poorly

This leaves three algorithms, all partitioning ones: ETP-H,TEP-H, TP-H Each dominates the other two in certain circum-stances, but TP-H performs poorly in the presence of time-stamp and attribute skew and is signiﬁcantly more compli-cated to implement Of the other two, ETP-H came out aheadmore often than TEP-H Thus we recommend ETP-H, a hybridvariant of explicit partitioning that partitions primarily by theexplicit attribute If this attribute is skewed so that some buck-ets do not ﬁt in memory, a further partition on the timestampattribute increases the possibility that the resulting buckets will

ﬁt in the available buffer space

The salient point of this study is that simple modiﬁcations

to an existing conventional evaluation algorithm (EP) can beused to effect temporal joins with acceptable performance and

at relatively small development cost While novel algorithms

Trang 26

(such as TP-H) may have better performance in certain

circum-stances, well-understood technology can be easily adapted and

will perform acceptably in many situations Hence database

vendors wishing to implement temporal join may do so with

a relatively low development cost and still achieve acceptable

performance

The above conclusion focuses on independent join

opera-tions rather than a query consisting of several algebraic

oper-ations Given the correlation between various operations, the

latter is more complex For example, one advantage of

sort-merge algorithms is that the output is also sorted, which can

be exploited in subsequent operations This interesting order

is used in traditional query optimization to reduce the cost of

the whole query We believe temporal query optimization can

also take advantage of this [50] Among the sort-merge

al-gorithms we have examined, the output of explicit alal-gorithms

(ES, ES-H, ETS, ETS-H) is sorted by the explicit join attribute;

interval join algorithms produce the output sorted by the start

timestamp Of these six algorithms, we recommend ETS-H

due to its higher efﬁciency

Several directions for future work exist Important

prob-lems remain to be addressed in temporal query processing, in

particular with respect to temporal query optimization While

several researchers have investigated algebraic query

opti-mization, little research has appeared with respect to

cost-based temporal query optimization

In relation to query evaluation, additional investigation of

the algorithm space described in Sect 5 is needed Many

op-timizations originally developed for conventional databases,

such as read-ahead and write-behind buffering, forecasting,

eager and lazy evaluation, and hash ﬁltering, should be

ap-plied and investigated Cache size and input buffer allocation

tuning is also an interesting issue

All of our partitioning algorithms generate maximal

parti-tions, that of the main memory size minus a few blocks for the

left-hand relation of the join, and then apply that partitioning

to the right-hand relation In the join step, a full left-hand

par-tition is brought into main memory and joined with successive

blocks from the associated right-hand partition Sitzmann and

Stuckey term this a static buffer allocation strategy and

in-stead advocate a dynamic buffer allocation strategy in which

the left-hand and right-hand relations are partitioned in one

step, so that two partitions, one from each relation, can

si-multaneously ﬁt in the main memory buffer [49] The

advan-tage over the static strategy is that fewer seeks are required

to read the right-hand side partition; the disadvantage is that

this strategy results in smaller, and thus more numerous,

par-titions, which increases the number of seeks and requires that

the right-hand side also be sampled, which also increases the

number of seeks It might be useful to augment the timestamp

partitioning to incorporate dynamic buffer allocation, though

it is not clear at the outset that this will yield a performance

beneﬁt over our TP-H algorithm or over ETP-H

Dynamic buffer allocation for conventional joins was ﬁrst

proposed by Harris and Ramamohanarao [22] They built the

cost model for nested loop and hash join algorithms with the

size of buffers as one of the parameters Then for each

algo-rithm they computed the optimal, or suboptimal but still good,

buffer allocation that led to the minimum join cost Finally,

the optimal buffer allocation was used to perform the join It

would be interesting to see if this strategy can improve the

per-formance of temporal joins It would also be useful to developcost models for the most promising temporal join algorithm(s),starting with ETP-H

The next logical progression in future work is to extend thiswork to index-based temporal joins, again investigating theeffectiveness of both explicit attribute indexing and timestampindexing While a large number of timestamp indexes havebeen proposed in the literature [44] and there has been somework on temporal joins that use temporal or spatial indexes [13,33,52,56], a comprehensive empirical comparison of thesealgorithms is needed

Orthogonally, more sophisticated techniques for ral database implementation should be considered In partic-ular, we expect specialized temporal database architectures tohave a signiﬁcant impact on query processing efﬁciency Ithas been argued in previous work that incremental query eval-uation is especially appropriate for temporal databases [24,34,41] In this approach, a query result is materialized andstored back into the database if it is anticipated that the samequery, or one similar to it, will be issued in the future Updates

tempo-to the contributing relations trigger corresponding updates tempo-tothe stored result The related topic of global query optimiza-tion, which attempts to exploit commonality between multiplequeries when formulating a query execution plan, also has yet

to be explored in a temporal setting

Acknowledgements This work was sponsored in part by National

Science Foundation Grants IIS-0100436, CDA-9500991,

EAE-0080123, IRI-9632569, and IIS-9817798, by the NSF Research frastructure Program Grants EIA-0080123 and EIA-9500991, by theDanish National Centre for IT-Research, and by grants from Ama-zon.com, the Boeing Corporation, and the Nykredit Corporation

In-We also thank In-Wei Li and Joseph Dunn for their help in menting the temporal join algorithms

3 Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The

R∗-tree: an efﬁcient and robust access method for points andrectangles In: Proceedings of the ACM SIGMOD conference,Atlantic City, NJ, 23–25 May 1990, pp 322–331

4 van den Bercken J, Seeger B (1996) Query processing niques for multiversion access methods In: Proceedings of theinternational conference on very large databases, Mubai (Bom-bay), India, 3–6 September 1996, pp 168–179

tech-5 B¨ohlen MH, Snodgrass RT, Soo MD (1997) Temporal ing In: Proceedings of the international conference on verylarge databases, Athens, Greece, 25–29 August 1997, pp 180–191

coalesc-6 Clifford J, Croker A (1987) The historical relational data model(HRDM) and algebra based on lifespans In: Proceedings of theinternational conference on data engineering, Los Angeles, 3–5February 1987, pp 528–537 IEEE Press, New York

7 Clifford J, Croker A (1993) The historical relational data model(HRDM) revisited In: Tansel A, Clifford J, Gadia S, Jajodia S,

Trang 27

Segev A, Snodgrass RT (eds) Temporal databases: theory,

de-sign, and implementation, ch 1 Benjamin/Cummings, Reading,

MA, pp 6–27

8 Clifford J, Uz Tansel A (1985) On an algebra for historical

rela-tional databases: two views In: Proceedings of the ACM

SIG-MOD international conference on management of data, Austin,

TX, 28–31 May 1985, pp 1–8

9 DeWitt DJ, Katz RH, Olken F, Shapiro LD, Stonebraker MR,

Wood D (1984) Implementation techniques for main memory

database systems In: Proceedings of the ACM SIGMOD

in-ternational conference on management of data, Boston, 18–21

June 1984, pp 1–8

10 Dittrich JP, Seeger B, Taylor DS, Widmayer P (2002)

Pro-gressive merge join: a generic and non-blocking sort-based join

algorithm In: Proceedings of the conference on very large

data-bases, Madison, WI, 3–6 June 2002, pp 299–310

11 Dunn J, Davey S, Descour A, Snodgrass RT (2002) Sequenced

subset operators: deﬁnition and implementation In:

Proceed-ings of the IEEE international conference on data engineering,

San Jose, 26 February–1 March 2002, pp 81–92

12 Dyreson CE, Snodgrass RT (1993) Timestamp semantics and

representation Inform Sys 18(3):143–166

13 Elmasri R, Wuu GTJ, Kim YJ (1990) The time index: an

ac-cess structure for temporal data In: Proceedings of the

confer-ence on very large databases, Brisbane, Queensland, Australia,

13–16 August 1990, pp 1–12

14 Etzion O, Jajodia S, Sripada S (1998) Temporal databases:

research and practice Lecture notes in computer science,

vol 1399 Springer, Berlin Heidelberg New York

15 Gadia SK (1988) A homogeneous relational model and query

languages for temporal databases ACM Trans Database Sys

13(4):418–448

16 Gao D, Jensen CS, Snodgrass RT, Soo MD (2002) Join

operations in temporal databases TimeCenter TR-71

http://www.cs.auc.dk/TimeCenter/pub.htm

17 Gao D, Kline N, Soo MD, Dunn J (2002)TimeIT: the Time

integrated testbed, v 2.0 Available via anonymous FTP at:

ftp.cs.arizona.edu

18 Graefe G (1993) Query evaluation techniques for large

data-bases ACM Comput Surv 25(2):73–170

19 Graefe G, Linville A, Shapiro LD (1994) Sort vs hash revisited

IEEE Trans Knowl Data Eng 6(6):934–944

20 Gunadhi H, Segev A (1991) Query processing algorithms for

temporal intersection joins In: Proceedings of the IEEE

con-ference on data engineering, Kobe, Japan, 8–12 April 1991,

pp 336–344

21 Guttman A (1984) R-trees: a dynamic index structure for spatial

searching In: Proceedings of the ACM SIGMOD conference,

Boston, 18–21 June 1984, pp 47–57

22 Harris EP, Ramamohanarao K (1996) Join algorithm costs

re-visited J Very Large Databases 5(1):64–84

23 Jensen CS (ed) (1998) The consensus glossary of temporal

database concepts – February 1998 version In [14], pp 367–

405

24 Jensen CS, Mark L, Roussopoulos N (1991) Incremental

imple-mentation model for relational databases with transaction time

IEEE Trans Knowl Data Eng 3(4):461–473

25 Jensen CS, Snodgrass RT, Soo MD (1996) Extending existing

dependency theory to temporal databases IEEE Trans Knowl

Data Eng 8(4):563–582

26 Jensen CS, Soo MD, Snodgrass RT (1994) Unifying temporal

models via a conceptual model Inform Sys 19(7):513–547

27 Leung TY, Muntz R (1990) Query processing for temporal

databases In: Proceedings of the IEEE conference on data

en-gineering, Los Angeles, 6–10 February 1990, pp 200–208

28 Leung TYC, Muntz RR (1992) Temporal query processing andoptimization in multiprocessor database machines In: Proceed-ings of the conference on very large databases, Vancouver, BC,Canada, pp 383–394

29 Leung TYC, Muntz RR (1993) Stream processing: ral query processing and optimization In: Tansel A, Clifford

tempo-J, Gadia S, Jajodia S, Segev A, Snodgrass RT (eds) ral databases: theory, design, and implementation, ch 14, Ben-jamin/Cummings, Reading, MA, pp 329–355

Tempo-30 Li W, Gao D, Snodgrass RT (2002) Skew handling techniques

in sort-merge join In: Proceedings of the ACM SIGMOD ference on management of data Madison, WI, 3–6 June 2002,

con-pp 169–180

31 Lo ML, Ravishankar CV (1994) Spatial joins using seeded trees.In: Proceedings of the ACM SIGMOD conference, Minneapo-lis, MN, 24–27 May 1994, pp 209–220

32 Lo ML, Ravishankar CV (1996) Spatial hash-joins In: ings of ACM SIGMOD conference, Montreal, 4–6 June 1996,

Proceed-pp 247–258

33 Lu H, Ooi BC, Tan KL (1994) On spatially partitioned temporaljoin In: Proceedings of the conference on very large databases,Santiago de Chile, Chile, 12–15 September 1994, pp 546–557

34 McKenzie E (1988) An algebraic language for query and date of temporal databases Ph.D dissertation, Department ofComputer Science, University of North Carolina, Chapel Hill,NC

up-35 Mishra P, Eich M (1992) Join processing in relational databases.ACM Comput Surv 24(1):63–113

36 Navathe S, Ahmed R (1993) Temporal extensions to the tional model and SQL In: Tansel A, Clifford J, Gadia S, Jajodia

rela-S, Segev A, Snodgrass RT (eds) Temporal databases: theory, sign, and implementation Benjamin/Cummings, Reading, MA,

de-pp 92–109

37 Orenstein JA (1986) Spatial query processing in an oriented database system In: Proceedings of the ACMSIGMOD conference, Washington, DC, 28–30 May 1986,

object-pp 326–336

38 Orenstein JA, Manola FA (1988) PROBE spatial data modelingand query processing in an image database application IEEETrans Software Eng 14(5):611–629

39 ¨Ozsoyoˇglu G, Snodgrass RT (1995) Temporal and real-timedatabases: a survey IEEE Trans Knowl Data Eng 7(4):513–532

40 Patel JM, DeWitt DJ (1996) Partition based spatial-merge join.In: Proceedings of the ACM SIGMOD conference, Montreal,4–6 June 1996, pp 259–270

41 Pfoser D, Jensen CS (1999) Incremental join of time-orienteddata In: Proceedings of the international conference on scien-tiﬁc and statistical database management, Cleveland, OH, 28–30July 1999, pp 232–243

42 Ramakrishnan R, Gehrke J (2000) Database management tems McGraw-Hill, New York

sys-43 Rana S, Fotouhi F (1993) Efﬁcient processing of time-joins

in temporal data bases In: Proceedings of the internationalsymposium on DB systems for advanced applications, Daejeon,South Korea, 6–8 April 1993, pp 427–432

44 Salzberg B, Tsotras VJ (1999) Comparison of access methodsfor time-evolving data ACM Comput Surv 31(2):158–221

45 Samet H (1990) The design and analysis of spatial data tures Addison-Wesley, Reading, MA

struc-46 Segev A (1993) Join processing and optimization in temporalrelational databases In: Tansel A, Clifford J, Gadia S, Jajo-dia S, Segev A, Snodgrass RT (eds) Temporal databases: the-ory, design, and implementation, ch 15 Benjamin/Cummings,Reading, MA, pp 356–387

Trang 28

47 Segev A, Gunadhi H (1989) Event-join optimization in temporal

relational databases In: Proceedings of the conference on very

large databases, Amsterdam, 22–25 August 1989, pp 205–215

48 Sellis T, Roussopoulos N, Faloutsos C (1987) The R+-tree: a

dynamic index for multidimensional objects In: Proceedings

of the conference on very large databases, Brighton, UK, 1–4

September 1987, pp 507–518

49 Sitzmann I, Stuckey PJ (2000) Improving temporal joins using

histograms In: Proceedings of the international conference on

database and expert systems applications, London/Greenwich,

UK, 4–8 September 2000, pp 488–498

50 Slivinskas G, Jensen CS, Snodgrass RT (2001) A foundation

for conventional and temporal query optimization addressing

duplicates and ordering Trans Knowl Data Eng 13(1):21–49

51 Snodgrass RT, Ahn I (1986) Temporal databases IEEE Comput

19(9):35–42

52 Son D, Elmasri R (1996) Efﬁcient temporal join processing

using time index In: Proceedings of the conference on statistical

and scientiﬁc database management, Stockholm, Sweden, 18–

20 June 1996, pp 252–261

53 Soo MD, Jensen CS, Snodgrass RT (1995) An algebra forTSQL2 In: Snodgrass RT (ed) The TSQL2 temporal querylanguage, ch 27, Kluwer, Amsterdam, pp 505–546

54 Soo MD, Snodgrass RT, Jensen CS (1994) Efﬁcient tion of the valid-time natural join In: Proceedings of the inter-national conference on data engineering, Houston, TX, 14–18February 1994, pp 282–292

evalua-55 Tsotras VJ, Kumar A (1996) Temporal database bibliographyupdate ACM SIGMOD Rec 25(1):41–51

56 Zhang D, Tsotras VJ, Seeger B (2002) Efﬁcient temporal joinprocessing using indices In: Proceedings of the IEEE interna-tional conference on data engineering, San Jose, 26 February–1March 2002, pp 103–113

57 Zurek T (1997) Optimisation of partitioned temporal joins.Ph.D Dissertation, Department of Computer Science, Edin-burgh University, Edinburgh, UK

Trang 29

Storing and querying XML data using denormalized relational databases

Andrey Balmin, Yannis Papakonstantinou

Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093

(e-mail:{abalmin,yannis}@cs.ucsd.edu)

Edited by A Halevy Received: December 21, 2001 / Accepted: July 1, 2003

Published online: June 23, 2004 – c Springer-Verlag 2004

Abstract XML database systems emerge as a result of the

ac-ceptance of the XML data model Recent works have followed

the promising approach of building XML database

manage-ment systems on underlying RDBMS’s Achieving query

pro-cessing performance reduces to two questions: (i) How should

the XML data be decomposed into data that are stored in the

RDBMS? (ii) How should the XML query be translated into an

efﬁcient plan that sends one or more SQL queries to the

under-lying RDBMS and combines the data into the XML result? We

provide a formal framework for XML Schema-driven

decom-positions, which encompasses the decompositions proposed

in prior work and extends them with decompositions that

em-ploy denormalized tables and binary-coded XML fragments

We provide corresponding query processing algorithms that

translate the XML query conditions into conditions on the

relational tables and assemble the decomposed data into the

XML query result Our key performance focus is the response

time for delivering the ﬁrst results of a query The most

effec-tive of the described decompositions have been implemented

in XCacheDB, an XML DBMS built on top of a commercial

RDBMS, which serves as our experimental basis We present

experiments and analysis that point to a class of

decomposi-tions, called inlined decomposidecomposi-tions, that improve query

per-formance for full results and ﬁrst results, without signiﬁcant

increase in the size of the database

1 Introduction

The acceptance and expansion of the XML model creates a

need for XML database systems [3,4,8,10,15,19,23,25,31,

32,34,35,41] One approach towards building XML DBMS’s

is based on leveraging an underlying RDBMS for storing

and querying the XML data This approach allows the XML

database to take advantage of mature relational technology,

which provides reliability, scalability, high performance

in-dices, concurrency control and other advanced functionality

Andrey Balmin has been supported by NSF IRI-9734548.

The authors built the XCacheDB system while on leave at Enosys

Software, Inc., during 2000

Schema Info

RDBMS

Data Storage

Tables’

Def.

XCacheDBQuery Processor

XML Results

Tuple Streams

Loading

Tuple Insertion

Querying

View XQuery

SQL Query

Optional User Guidance

Schema Processor

Schema Decomposition

Data Decomposer XML

Data

XML Schema

XCacheDBLoader

Fig 1 The XML database architecture

We provide a formal framework for XML Schema-drivendecompositions of the XML data into relational data Thedescribed framework encompasses the decompositions de-scribed in prior work on XML Schema-driven decompositions[3,34] and extends prior work with a wide range of decom-positions that employ denormalized tables and binary-codednon-atomic XML fragments

The most effective among the set of the described

decom-positions have been implemented in the presented XCacheDB,

an XML DBMS built on top of a commercial RDBMS [5].XCacheDB follows the typical architecture (see Fig 1) of

an XML database built on top of a RDBMS [3,8,23,32,34].First, XML data, accompanied by their XML Schema [38], is

loaded into the database using the XCacheDB loader, which consists of two modules: the schema processor and the data

decomposer The schema processor inputs the XML Schema

and creates in the underlying relational database tables quired to store any document conforming to the given XMLschema The conversion of the XML schema into relationalmay use optional user guidance The mapping from the XML

Trang 30

re-schema to the relational is called re-schema decomposition.1The

data decomposer converts XML documents conforming to the

XML schema into tuples that are inserted into the relational

database

XML data loaded into the relational database are queried

by the XCacheDB query processor The processor exports an

XML view identical to the imported XML data A client

is-sues an XML query against the view The processor translates

the query into one or more SQL queries and combines the

result tuples into the XML result Notice that the underlying

relational database is transparent to the query client

The key challenges in XML databases built on relational

systems are

1 how to decompose the XML data into relational data,

2 how to translate the XML query into a plan that sends

one or more SQL queries to the underlying RDBMS and

constructs an XML result from the relational tuple streams

A number of decomposition schemes have been proposed

[3,8,11,34] However all prior works have adhered to

de-composing into normalized relational schemas Normalized

decompositions convert an XML document into a typically

large number of tuples of different relations Performance is

hurt when an XML query that asks for some parts of the

origi-nal XML document results in an SQL query (or SQL queries)

that has to perform a large number of joins to retrieve and

reconstruct all the necessary information

We provide a formal framework that describes a wide space

of XML Schema-driven denormalized decompositions and we

explore this space to optimize query performance Note that

denormalized decompositions may involve a set of relational

design anomalies; namely, non-atomic values, functional

de-pendencies and multivalued dede-pendencies Such anomalies

in-troduce redundancy and impede the correct maintenance of

the database [14] However, given that the decomposition is

transparent to the user, the introduced anomalies are

irrele-vant from a maintenance point of view Moreover, the XML

databases today are mostly used in web-based query

sys-tems where datasets are updated relatively infrequently and

the query performance is crucial Thus, in our analysis of the

schema decompositions we focus primarily on their

repercus-sions on query performance and secondarily on storage space

and update speed

The XCacheDB employs the most effective of the

de-scribed decompositions It employs two techniques that trade

space for query performance by denormalizing the relational

data

• non-Normal Form (non-NF) tables eliminate many joins,

along with the particularly expensive join start-up time

• BLOBs are used to store pre-parsed XML fragments, hence

facilitating the construction of XML results BLOBs

elim-inate the joins and “order by" clauses that are needed for

the efﬁcient grouping of the ﬂat relational data into nested

XML structures, as it was previously shown in [33]

Overall, both of the techniques have a positive impact on

total query execution time in most cases The results are most

impressive when we measure the response time, i.e the time it

takes to output the ﬁrst few fragments of the result Response

1

XCacheDB stores it in the relational database as well

time is important for web-based query systems where userstend to, ﬁrst, issue under-constrained queries, for purposes

of information discovery They want to quickly retrieve thefirst results and then issue a more precise query At the sametime, web interfaces do not need more than the first few resultssince the limited monitor space does not allow the display oftoo much data Hence it is most important to produce the firstfew results quickly

Our main contributions are:

• We provide a framework that organizes and formalizes a

wide spectrum of decompositions of the XML data intorelational databases

• We classify the schema decompositions based on the

de-pendencies in the produced relational schemas We

iden-tify a class of mappings called inlined decompositions that

allow us to considerably improve query performance byreducing the number of joins in a query, without a signiﬁ-cant increase in the size of the database

• We describe data decomposition, conversion of an XML

query into an SQL query to the underlying RDBMS, andcomposition of the relational result into the XML result

• We have built in the XCacheDB system the most effective

of the possible decompositions

• Our experiments demonstrate that under typical

condi-tions certain denormalized decomposicondi-tions provide icant improvements in query performance and especially

signif-in query response time In some cases, we observed up to400% improvement in total time (Fig 23, Q1 with selec-tivity 0.1%) and 2-100 times in response time (Fig 23, Q1with selectivity above 10%)

The rest of this paper is organized as follows In Sect 2

we discuss related work In Sect 3, we present deﬁnitions andframework Section 4 presents the decompositions of XMLSchemas into sets of relations In Sect 5, we present algo-rithms for translating the XML queries into SQL, and assem-bling the XML results In Sect 6, we discuss the architecture

of XCacheDB along with interesting implementation aspects

In Sect 7, we present the experiment results We conclude anddiscuss directions for future work in Sect 8

2 Related work

The use of relational databases for storing and querying XMLhas been advocated before by [3,8,11,23,32,34] Some ofthese works [8,11,23] did not assume knowledge of an XMLschema In particular, the Agora project employed a ﬁxed re-lational schema, which stores a tuple per XML element Thisapproach is ﬂexible but it is less competitive than the otherapproaches, because of the performance problems caused bythe large number of joins in the resulting SQL queries TheSTORED system [8] also employed a schema-less approach.However, STORED used data mining techniques to discoverpatterns in data and automatically generate XML-to-Relation-

al mappings

The works of [34] and [3] considered using DTD’s andXML Schemas to guide mapping of XML documents into re-lations [34] considered a number of decompositions leading

to normalized tables The “hybrid" approach, which providesthe best performance, is identical to our “minimal 4NF decom-position" The other approaches of [34] can also be modeled

Trang 31

by our framework In one respect our model is more restrictive,

as we only consider DAG schemas while [34] also takes into

account cyclic schemas It is possible to extend our approach

to arbitrary schema graphs by utilizing their techniques [3]

studies horizontal and vertical partitioning of the minimal 4NF

schemas Their results are directly applicable in our case

How-ever we chose not to experiment with those decompositions,

since their effect, besides being already studied, tends to be

less dramatic than the effect of producing denormalized

rela-tions Note also that [3] uses a cost-based optimizer to ﬁnd

an optimal mapping for a given query mix The query mix

approach can beneﬁt our work as well

To the best of our knowledge, this is the ﬁrst work to use

denormalized decompositions to enhance query performance

There are also other related works in the intersection of

relational databases and XML The construction of XML

re-sults from relational data was studied by [12,13,33] [33]

con-sidered a variety of techniques for grouping and tagging

re-sults of the relational queries to produce the XML documents

It is interesting to note the comparison between the “sorted

outer union" approach and BLOBs, which signiﬁcantly

im-prove query performance The SilkRoute [12,13] considered

using multiple SQL queries to answer a single XML Query and

speciﬁed the optimal approach for various situations, which

are applicable in our case as well

Oracle 8i/9i, IBM DB2, and Microsoft SQL Server provide

some basic XML support [4,19,31] None of these products

support XQuery or any other full-featured XML query

lan-guage as of May 2003

Another approach towards storing and querying XML is

based on native XML and OODB technologies [15,25,35]

The BLOBs resemble the common object-oriented technique

of clustering together objects that are likely to be queried and

retrieved jointly [2] Also, the non-normal form relations that

we use are similar to path indices, such as the “access support

relations" proposed by Kemper and Moerkotte [20] An

im-portant difference is that we store data together with an index,

similarly to Oracle’s “index organized tables" [4]

A number of commercial XML databases are avaliable

Some of these systems [9,21,24] only support API data access

and are effectively persistent implementations of the

Docu-ment Object Model [36] However, most of the systems [1,

6,10,17,18,26,27,35,40–42,44] implement the XPath query

language or its variations Some vendors [10,26,35] have

an-nounced XQuery [39] support in the upcoming versions,

how-ever only X-Hive 3.0 XQuery processor [41] and Ipedo XML

Database [18] were publically available at the time of writing

The majority of the above systems use native XML

stor-age, but some [10,40,41] are implemented on top of

object-oriented databases Besides the query processing some of the

commercial XML databases support full text searches [18,41,

44], transactional updates [6,10,18,26,40,42] and document

versioning [18,40]

Even though XPath does not support heterogeneous joins,

some systems [27,35] recognize their importance for the data

integration applications and provide facilities that enable this

feature

Our work concentrates on selection and join queries

An-other important class of XML queries involve path

expres-sions A number of schemes [16,22] have been proposed

re-cently that employ various node numbering techniques to

fa-cilitate evaluation of path expressions For instance, [22] poses to use pairs of numbers (start position and sub-tree size)

pro-to identify nodes The XSearch system [43] employs Deweyencoding of node IDs to quickly test for ancestor-descendantrelationships These techniques can be applied in the context

of XCacheDB, since the only restriction that we place on nodeIDs is their uniqueness

3 Framework

We use the conventional labeled tree notation to representXML data The nodes of the tree correspond to XML ele-ments, and are labeled with the elements’ tags Tags that startwith the “@" symbol stand for attributes Leaf nodes may also

be labeled with values that correspond to the string content.Note that we treat XML as a database model that allows forrich structures that contain nesting, irregularities, and struc-tural variance across the objects We assume the presence ofXML Schema, and expect the data to be accessed via an XMLquery language such as XQuery We have excluded many doc-ument oriented features of XML such as mixed content, com-ments and processing instructions

Every node has a unique id invented by the system Theid’s play an important role in the conversion of the tree torelational data, as well as in the reconstruction of the XMLfragments from the relational query results

Deﬁnition 1 (XML document) An XML document is a tree

where

1 Every node has a label l coming from the set of element tags L

2 Every node has a unique id

3 Every atomic node has an additional label v coming from the set of values V Atomic nodes can only be leafs of the

♦

Figure 2 shows an example of an XML document tree Wewill use this tree as our running example We consider onlyunordered trees We can extend our approach to ordered treesbecause the node id’s are assigned by a depth ﬁrst traversal ofthe XML documents, and can be used to order sibling nodes

3.1 XML schema

We use schema graphs to abstract the syntax of XML Schema

Deﬁnitions [38] The following example illustrates the nection between XML Schemas and schema graphs

con-Example 1 Consider the XML Schema of Fig 3 and the

corre-sponding schema graph of Fig 4 They both correspond to theTPC-H [7] data of Fig 2 The schema indicates that the XML

data set has a root element named Customers, which contains one or more Customer elements Each Customer contains (in some order) all of the atomic elements Name, Address, and

MarketSegment, as well as zero or more complex elements Order and PreferedSupplier These complex elements in turn

contain other sets of elements

2However, not every leaf has to be an atomic node Leafs can also

be empty elements

Trang 32

LineItem [id=21]

Part [id=22]

[3655]

Supplier [id=23]

[415]

Price [id=23]

[57670.05]

Quantity [id=24]

[37.0]

Discount [id=25]

[0.09]

Preferred Supplier [id=36]

Name [id=38]

[“Supplier10”]

Number [id=37]

[10]

Address [id=39]

Nation [id=42]

[“USA”]

Street [id=40]

[“1 supplier10 st.”]

City [id=41]

[“San Diego,

CA 92126”]

Customer [id=2]

Address [id=43]

[“1 furniture way,

CA 92093”]

Market Segment [id=44]

[“furniture”]

Name [id=28]

[“Customer#1”]

Order [id=13]

Status [id=20]

[“F”]

Number [id=14]

[135943]

Price [id=26]

[263247.53]

Date [id=27]

[6/22/1993 0:0:0]

Customers [id=1]

Preferred Supplier [id=29]

Number [id=30]

[415]

Address [id=32]

Nation [id=35]

[“USA”]

City [id=34 [“San Diego,

Part [id=16]

[9897]

Supplier [id=17]

[416]

Price [id=18]

[66854.93]

Quantity [id=19]

[37.0]

Fig 2 A sample TPCH-like XML data set Id’s and data values appear in brackets

Notice that XML schemas and schema graphs are in some

respect more powerful than DTDs [37] For example, in the

schema graph of Fig 4 both Customer and Supplier have

Address subelements, but the customer’s address is simply

a string, while the supplier’s address consists of Street and

City elements DTD’s cannot contain elements with the same

name, but different content types

Deﬁnition 2 (Schema graph) A schema is a directed graph

1 Every node has a label l that is one of “all", or “choice",

or is coming from the set of element tags L Nodes labeled

“all" and “choice" have at least two children.

2 Every leaf node has a label t coming from the set of types

T

3 Every edge is annotated with “minOccurs" and

“maxOc-curs" labels, which can be a non-negative integer or

“un-bounded".

4 A single node r is identiﬁed as the “root" Every node of

♦

Schema graph nodes labeled with element tags are called

tag nodes; the rest of the nodes are called link nodes.

Since we use an unordered data model, we do not include

“sequence" nodes in the schema graphs Their treatment is

identical to that of “all" nodes We also modify the usual

deﬁ-nition of a valid document to account for the unordered model.

To do that, we, ﬁrst, deﬁne the content type of a schema node,

which deﬁnes bags of sibling XML elements that are validwith respect to the schema node

Deﬁnition 3 (Content type) Every node g of a schema graph

schema nodes, deﬁned by the following recursive rules.

• If g is a tag node, T (g) = {{g}}

min i (g i ), where

min i (g i ) is a union of all bags obtained by

min i (g i ) also includes an empty

bag.

• If g is an “all" node g = all(g1, , g n ), then T (g) is a

union of all bags obtained by concatenation of n bags –

min i (g i ).

♦

Deﬁnition 4 (Document tree valid wrt schema graph) We

say that a document tree T is valid with respect to schema

Trang 33

<?xml version = "1.0" encoding = "UTF-8"?>

<xsd:element ref = "number"/>

<xsd:e lement ref = "name"/>

<xsd:element ref = "address"/>

<xsd:element ref = "market"/>

<xsd:element ref = "orders" minOccurs = "0" maxOccurs = "unbounded"/>

<xsd:element ref = "preferred_supplier" minOccurs = "0" maxOccurs = "unbounded"/>

</xsd:all>

</xsd:complexType>

</xsd:element>

<xsd:element name = "number" type = "xsd:integer"/>

<xsd:element name = "name" type = "xsd:string"/>

<xsd:element name = "address" type = "xsd:string"/>

<xsd:element name = "market" type = "xsd:string" />

<xsd:element name = "orders">

<xsd:complexType>

<xsd:all>

<xsd:element ref = "status"/>

<xsd:element ref = "price"/>

<xsd:element ref = "date"/>

<xsd:element ref = "lineitem" minOccurs = "0" maxOccurs = "unbounded"/>

<xsd:element ref = "name"/>

<xsd:element ref = "address"/>

<xsd:element ref = "nation"/>

<xsd:element ref = "balance"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:element name = "status" type = "xsd:string"/>

<xsd:element name = "price" type = "xsd:float"/>

<xsd:element name = "date " type = "xsd:string"/>

<xsd:element name = "lineitem">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref = "part"/>

<xsd:element ref = "supplier"/>

<xsd:element ref = "quantity"/>

<xsd:element ref = "price"/>

<xsd:element ref = "disc ount" minOccurs = "0"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:element name = "part" type = "xsd:integer"/>

<xsd:element name = "supplier" type = "xsd:integer"/>

<xsd:element name = "quantity" type = "xsd:float"/>

<xsd:element nam e = "discount" type = "xsd:float"/>

<xsd:element name = "nation" type = "xsd:string"/>

<xsd:element name = "balance" type = "xsd:float"/>

min (g c ), where g c is the child of g, and min

♦

Figure 5 illustrates how the content types are assigned

and used in the document validation The Address element

on the right is valid with respect to the schema graph on the

left Each schema node is annotated with its content type

LineItem

Part [Integer]

Supplier [Integer]

Price [Float] Quantity[Float]

Discount [Float]

?

Customers

Address [String]

Market Segment [String]

Name [String]

Number [Integer]

Price [Float]

Date [Date]

All

Preferred Supplier

Name [String]

Number [Integer] Address

Nation [String] Street

[String]

City [String]

* +

Street [String]

Choice All

[92126]

Address [id = 6]

Fig 5 Content types and document tree validation

For example, the type of the “choice" node is{{Street},{PO

tree nodes to the tag nodes of the schema graph (mappings areshown by the dashed lines), in such a way that the bag of typescorresponding to the children of every XML node is a member

of the content type of the child of the corresponding schema

node For example, the children of the Address element belong

to the content type of the “all" node

Normalized schema graphs To simplify the presentation we

only consider normalized schema graphs, where all ing edges of link nodes have maxOccurs= 1 Any schemagraph can be converted into, a possibly less restrictive, nor-malized schema graph by a top-down breadth-ﬁrst traversal ofthe schema graph that applies the following rules For every

incom-link node N that has an incoming edge with minOccurs=

edge of N is multiplied by inM ax The result of the

prod-uct is “unbounded" if at least one parameter is “unbounded"

Similarly, if inM in > 1, the minOccurs is set to 1 and the

inM in Also, if N is a “choice", it gets replaced with an “all"

node with the same set of children, and for every outgoing edge

the minOccur is set to 0 For example, the schema graph of

Fig 6a will be normalized into the graph of Fig 6c Notice thatthe topmost “choice" node is replaced by “all", since a cus-

Trang 34

Zip [String]

Street [String]

*

All

Choice All

PO Box [String]

Customer

Choice

Preferred Supplier Address

PO Box [String]

Zip [String]

Street [String]

Choice All

PO Box [String]

Without loss of generality to the decomposition

algo-rithms described next, we only consider schemas where

unbounded We use the following symbols: “1", “*", “?", “+",

to encode the “minOccurs"/“maxOccurs" pairs For brevity,

we omit “1" annotations in the ﬁgures We also omit “all"

nodes if their incoming edges are labeled “1", whenever this

doesn’t cause an ambiguity

We only consider acyclic schema graphs Schema graph

nodes that are pointed by a “*" or a “+" will be called

repeat-able.

4 XML decompositions

We describe next the steps of decomposing an XML document

into a relational database First, we produce a schema

decom-position, i.e., we use the schema graph to create a relational

schema Second, we decompose the XML data and load it into

the corresponding tables We use the schema decomposition

to guide the data load

The generation of an equivalent relational schema

pro-ceeds in two steps First, we decompose the schema graph

into fragments Second, we generate a relational table

deﬁni-tion for each fragment

Deﬁnition 5 (Schema decomposition) A schema

where each fragment is a subset of nodes of G that form a

connected DAG Every tag node of G has to be member of at

Due to acyclicity of the schema graphs, each fragment

has at least one fragment root node, i.e., a node that does not

have incoming edges from any other node of the fragment

Similarly, fragment leaf nodes are the nodes that do not have

outgoing edges that lead to other nodes of the fragment Note

that a schema decomposition is not necessarily a partition of

?

Status

Number

Price Date

Preferred Supplier

Name

Number

Nation Address Street City

* Customer Address

Market Segment Name

*

Customers +

*

Fig 7 An XML schema decomposition

the schema graph – a node may be included in multiple ments (Fig 7)

frag-Some fragments may contain only “choice" and “all"

nodes We call these fragments trivial, since they correspond

to empty data fragments We only consider decompositions

which contain connected, non-trivial fragments, where all

fragment leafs are tag nodes

DAG schemas offer an extra degree of freedom, since anequivalent schema can be obtained by “splitting" some of thenodes that have more than one ancestor For example, theschema of Fig 6b, can be obtained from the schema of Fig 6a

by splitting at element Address Such a split corresponds to a

derived horizontal partitioning of a relational schema [28].

Similarly, element nodes may also be eliminated by

“com-bining" nodes For example, an all(a∗, b, a∗) may be reduced

to all (a∗, b) if types of both a’s are equal3 Since we sider an unordered data model, the queries cannot distinguish

con-between “ﬁrst" and “second" a’s in the data Thus, we do not

need to differentiate between them A similar DTD reductionprocess was used in [34] However, unlike [34] our decompo-sitions do not require reduction and offer ﬂexibility needed tosupport the document order Similar functionality is included

in LegoDB [3]

Deﬁnition 6 (Path set, equivalent schema graphs) A path

possible paths in G that originate at the root of G Two schema

We deﬁne the set of generalized schema decompositions

of a graph G to be the set of schema decompositions of all graphs G that are equivalent to G (including the schema decompositions of G itself.) Whenever it is obvious from the

context we will say “set of schema decompositions" implyingthe set of generalized schema decompositions

Deﬁnition 7 (Root fragments, parent fragments) A root

fragment is a fragment that contains the root of the schema

graph For each non-root fragment F we deﬁne its Parent

Frag-ments in the following way: Let R be a root node of F , and

3

We say that typesA and B are equal, if every element that is

valid wrtA is also valid wrt B, and vice versa.

Trang 35

Street [id = 4]

[“9500 Gilman Dr”] PO Box[id = 8]

[1000]

Zip [id = 3]

[92093] [id = 7]Zip

[92126]

address_

id

Address

1 null 2 3 92093 4 “9500 Gilman Dr.” null null

null 5 6 7 92126 null null 8 1000

Address [id = 6]

Customer [id=1]

Supplier

Customer

Supplier [id = 5]

zip_id zip stre

d

stre

POBox_i d

POBox

custo

mer_

ref

supplie f

address_ref street_id street

2 4 “9500 Gilman Dr.”

Street Sequence

address_ref POBox_id POBox

6 8 1000 POBox

address_ref zip_id zip

2 3 92093

Zip

address_ref street_id street POBox_id POBox

2 4 “9500 Gilman Dr.” null null

Fig 8 Loading data into fragment tables

let P be a parent of R in the schema graph Any fragment that

Deﬁnition 8 (Fragment table) A Fragment Table T

If N is an atomic node the schema tree T also has an attribute

each distinct path that leads to a root of F from a repeatable

ancestor A and does not include any intermediate repeatable

ancestors The parent reference columns store the value of the

For example, consider the Address fragment table of

Fig 8 Regardless of other fragments present in the

decom-position, theAddresstable will have two parent reference

columns One column will refer to the Customer element and

another to the Supplier Since we consider only tree data,

ev-ery tuple of theAddresstable will have exactly one non-null

parent reference

A fragment table is named after the left-most root of the

corresponding fragment Since multiple schema nodes can

have the same name, name collisions are resolved by

append-ing a unique integer

We use null values in ID columns to represent missing

op-tional elements For example, the null value in thePOBox id

of the ﬁrst tuple of theAddresstable indicates that the

Ad-dress element with id=2 does not have a POBox subelement.

An empty XML element N is denoted by a non-null value in

A N ID and a null in A N

Data load We use the following inductive deﬁnition of

frag-ment tables’ content First, we deﬁne the data content of a

fragment consisting of a single tag node N The fragment

ta-ble T N , called node table, contains an ID attribute A N ID, a

value attribute A N, and one or more parent attributes Let us

4 Note that a decomposition can have multiple root fragments, and

a fragment can have multiple parent fragments

Street [String]

Choice All

PO Box [String]

Address_id Zip_id Zip Street_id Street

Street [String]

Choice All

PO Box [String]

Address_id Zip_id Zip Street_id Street

Fig 9a,b Alternative fragmentations of data of Fig 8

consider a Typed Document Tree D , where each node of D is mapped to a node of the schema graph A tuple is stored in T N for each node d ∈ D, such that (d → N) ∈ D Assume that

d is a child of the node p ∈ D, such that (p → P ) ∈ D .

The table T N will be populated with the following tuple:

A P ID = p id , A N ID = d id , A N = d If T N contains

par-ent attributes other than A P ID, they are set to null

A table T corresponding to an internal node N is populated

depending on the type of the node

• If N is an “all", then T is the result of a join of all children

tables on parent reference attributes

of all children tables

• If N is a tag node, which by deﬁnition has exactly one child

node with a corresponding table T C , then T = T N 1 T C

The following example illustrates the above deﬁnition tice that the XCacheDB Loader does not use the brute forceimplementation suggested in the example We employ opti-mizations that eliminate the majority of the joins

No-Example 2 Consider the schema graph fragment, and the

cor-responding data fragment of Fig 8 TheAddressfragmenttable is built from node tablesZip,Street, andPOBox, ac-cording to the algorithm described above A table correspond-ing to the “choice" node in the schema graph is built by taking

an outer union ofStreetandPOBox The result is joined withZipto obtain the table corresponding to the “all" node The re-sult of the join is, in turn, joined with theAddressnode table(not shown) which contains three attributes “customer ref",

“supplier ref", and “address id"

Alternatively, the “Address" fragment of Fig 8 can be split

in two as shown in Fig 9a and b The dashed lines in Fig 9b

indicates that a horizontal partitioning of the fragment should

occur along the “choice" node This line indicates that the ment table should be split into two Each table projects out at-tributes corresponding to one side of the “choice" The tuples6

frag-Outer union of two tablesP and Q is a table T , with a set of

attributesattr(T ) = attr(P ) ∪ attr(Q) The table T contains all

tuples ofP and Q extended with nulls in all the attributes that were

not present in the original

Trang 36

of the original table are partitioned into the two tables based

on the null values of the projected attributes This operation

is similar to the “union distribution" discussed in [3]

Hori-zontal partitioning improves the performance of queries that

access either side of the union (e.g., either Street or POBox

el-ements) However, performance may degrade for queries that

access only Zip elements Since we assume no knowledge of

the query workload, we do not perform horizontal partitioning

automatically, but leave it as an option to the system

adminis-trator

The following example illustrates decomposing

TPCH-like XML schema of Fig 4 and loading it with data of Fig 2

Example 3 Consider the schema decomposition of Fig 10.

The decomposition consists of three fragments rooted at the

elements Customers, Order, and Address Hence the

corre-sponding relational schema has tablesCustomers,Order,

andAddress The bottom part of Fig 10 illustrates the

con-tents of each table for the dataset of Fig 2 Notice that the

tablesCustomersandOrderare not in BCNF

For example, the tableOrderhas the non-key functional

dependency “order id → number id", which introduces

re-dundancy

We use “(FK)" labels in Fig 10 to indicate parent

refer-ences Technically these references are not foreign keys since

they do not necessarily refer to a primary key

Alternatively one could have decomposed the example

schema as shown in Fig 7 In this case there is a non-FD

multi-valued dependency (MVD) in theCustomerstable, i.e., an

MVD that is not implied by a functional dependency Orders

and preferred suppliers of every customer are independent of

each other:

customers id, customer id, c name id, c address id,

c marketSegment id, c name, c address,

p name id, p number id, p nation id, p name,

p number, p nation, p address id, a street id,

a city id, a street, a city

The decompositions that contain non-FD MVD’s are

called MVD decompositions.

Vertical partitioning In the schema of Fig 10 the Address

element is not repeatable, which means that there is at most

one address per supplier Using a separateAddresstable is an

example of vertical partitioning because there is a one-to-one

relationship between theAddresstable and its parent table

Customers The vertical partitioning of XML data was

stud-ied in [3], which suggests that partitioning can improve

perfor-mance if the query workload is known in advance Knowing

the groups of attributes that get accessed together, the vertical

partitioning can be used to reduce table width without

incur-ring a big penalty from the extra joins We do not consider

vertical partitioning in this paper, but the results of [3] can be

carried over to our approach We use the term minimal to refer

to decompositions without vertical partitioning

Deﬁnition 9 (Minimal decompositions) A decomposition is

minimal if all edges connecting nodes of different fragments

Figure 7 and Fig 11 show two different minimal positions of the same schema We call the decomposition of

decom-Fig 11 a 4NF decomposition because all its fragments are

4NF fragments (i.e the fragment tables are in 4NF) Note that

a fragment is 4NF if and only if it does not include any “*" or

“+" labeled edges, i.e no two nodes of the fragment are nected by a “*" or “+" labeled edge We assume that the onlydependencies present are those derived by the decomposition.Every XML Schema tree has exactly one minimal 4NF de-composition, which minimizes the space requirements Fromhere on, we only consider minimal decompositions

con-Prior work [3,34] considers only 4NF decompositions.However we employ denormalized decompositions to im-prove query execution time as well as response time Par-ticularly important for performance purposes is the class ofinlined decompositions described below The inlined decom-positions improve query performance by reducing the number

of joins, and (unlike MVD decompositions) the space head that they introduce depends only on the schema and not

over-on the dataset

Deﬁnition 10 (NonMVD decompositions and inlined

de-compositions) A non-MVD fragment is one where all “*"

and “+" labeled edges appear in a single path A non-MVD decomposition is one that has only non-MVD fragments An

inlined fragment is a non-MVD fragment that is not a 4NF

fragment An inlined decomposition is a non-MVD

The non-MVD fragment tables may have functional pendencies (FD’s) that violate the BCNF condition (and alsothe 3NF condition [14]), but they have no non-FD MVD’s For

de-example, the Customers table of Fig 10 contains the FD

that breaks the BCNF condition, since the key is

“c preferredSupplier id" However, the table has no non-FDMVD’s

From the point of view of the relational data, an inlined

fragment table is the join of fragment tables that correspond

to a line of two or more 4NF fragments For example, the

fragment table Customers of Fig 10 is the join of the ment tables that correspond to the 4NF fragments Customers and PreferredSupplier of Fig 11 The tables that correspond

frag-to inlined fragments are very useful because they reduce thenumber of joins while they keep the number of tuples in thefragment tables low

Lemma 1 (Space overhead as a function of schema size).

data set, the number of tuples of F is less than the total number

Proof Let’s consider the following three cases First, if the

schema tree edge that connects F1and F2is labeled with “1"

or “?", the tuples of F2will be inlined with F1 Thus F will have the same number of tuples as F1

7

A fragment consisting of two non-MVD fragments connectedtogether, is not guaranteed to be non-MVD

Trang 37

address_id c_preferredSupplier_ref street_id street city_id city

Address

32 29 33 “1 supplier10 St.” 34 “San Diego, CA 92126”

39 36 40 “1 supplier415 St.” 41 “San Diego, CA 92126”

1 2 28 “Customer1” 43 “1 Furniture ” 44 “furniture” 29 30 415 31 “Supplier415” 35 “USA”

1 2 28 “Customer1” 43 “1 Furniture ” 44 “furniture” 36 37 10 38 “Supplier10” 42 “USA”

l_

nt d

?

Status

Number

Price Date

Preferred Supplier

Name

Number

Nation Address

*

Customer Address

Market Segment Name

*

Customers +

*

Customers

customers_id customer_id

c_name_id

c_name c_address_id c_address c_marketSegment_id c_marketSegment

c_preferredSupplier_id(PK)

p_number_id p_number p_name_id p_name p_nation_id p_nation

address_id (PK)

c_preferredSupplier_ref (FK) street_id

street city_id city

Address

order_id customer_ref (FK) number_id number status_id status price_id price date_id date

lineitem _id (P K)

l_part_id l_part l_supplier_id l_supplier l_price_id l_price l_quantity_id l_quantity l_discount_id l_discount

Trang 38

Status

Number

Price Date

Preferred Supplier

Name

Number

Nation Address Street City

*

Customer Address

Market Segment Name

*

Customers +

*

Fig 11 Minimal 4NF XML schema decomposition

All Decompositions

4NF MVD

Minimal Decompositions

Inlined Non-MVD

Fig 12 Classiﬁcation of schema decompositions

Second, if the edge is labeled with “+", F will have the same

number of tuples as F2, since F will be the result of the join

of F1and F2, and the schema implies that for every tuple in

F2, there is exactly one matching tuple, but no more in F1.

Third, if the edge is labeled with “*", F will have fewer tuples

than the total of F1and F2, since F will be the result of the

left outer join of F1and F2

We found that the inlined decompositions can provide

sig-niﬁcant query performance improvement Noticeably, the

stor-age space overhead of such decompositions is limited, even if

the decomposition include all possible non-MVD fragments

Deﬁnition 11 (Complete non-MVD decompositions) A

complete non-MVD decomposition, complete for short, is one

The complete non-MVD decompositions are only

in-tended for the illustrative purpose, and we are not advocating

their practical use

Note that a complete non-MVD decomposition includes

all fragments of the 4NF decomposition The other fragments

of the complete decomposition consist of fragments of the

4NF decomposition connected together In fact, a 4NF

de-composition can be viewed as a tree of 4NF fragments, called

4NF fragment tree The fragments of a complete minimal

non-MVD decomposition correspond to the set of paths in this

tree The space overhead of a complete decompositions is afunction of the size of the 4NF fragment tree

Lemma 2 (Space overhead of a complete decomposition

as a function of schema) Consider a schema graph G, its

of tuples of the complete decomposition is

|D C (G)| =k

i=1

where h is the height of the 4NF fragment tree of G, and n is

Proof Consider a record tree R constructed from an XML

document tree T in the following fashion A node of the record

tree is created for every tuple of the 4NF data decomposition

D 4NF (T ) Edges of the record tree denote child-parent

rela-tionships between tuples There is a one to one mapping frompaths in the record tree to paths in its 4NF fragment tree, and

the height of the record tree h equals to the height of the 4NF fragment tree Since any fragment of D C (G) maps to a path

in the 4NF fragment tree, every tuple of D C (T ) maps to a

path in the record tree The number of path’s in the record tree

P (R) can be computed by the following recursive expression:

P (R) = N(R) + P (R1) + + P (R n ), where N(R) is the

number of nodes in the record tree and stands for all the paths

that start at the root R i’s denote subtrees rooted at the children

of the root The maximum depth of the recursion is h At each

level of the recursion, after the ﬁrst one, the total number of

added paths is less than N Thus P (R) < hN.

Multiple tuples of D C (T ) may map to the same path in the record tree, because each tuple of D C (T ) is a result of some outerjoin of tuples of D 4NF (T ), and the same tuple may be

a result of multiple outer joins (e.g A 1 B = A 1 B

1 C, if C is empty.) However the same tuple cannot be a

result of more than n distinct left outerjoins Thus |D C (G)| ≤

P (R) ∗ n By deﬁnition |D 4NF (G)| = N; hence |D C (G)| <

|D 4NF (G)| ∗ h ∗ n 4.1 BLOBs

To speed up construction of the XML results from the lational result-sets XCacheDB stores a binary image of pre-parsed XML subtrees as Binary Large OBjects (BLOBs) Thebinary format is optimized for efﬁcient navigation and print-ing of the XML fragments The fragments are stored in specialBLOBs tables that use node ID’s as foreign keys to associatethe XML fragments to the appropriate data elements

re-By default, every subtree of the document except the trivialones (the entire document and separate leaf elements) is stored

in the Blobs table This approach may have unnecessarily high

space overhead because the data gets replicated up to H − 2

times, where H is the depth of the schema tree We reduce the overhead by providing a graphical utility, the XCacheDB

Loader, which allows the user to control which schema nodes

get “BLOB-ed", by annotating the XML Schema The usershould BLOB only those elements that are likely to be returned

by the queries For example, in the decomposition of Fig 10

Trang 39

Result {$N,$O}

$O

Result Tree Customers

root

Fig 13 XML query notation

onlyOrderandPreferredSupplierelements were

cho-sen to be BLOB-ed, as indicated by the boxes.Customer

elements may be too large and too infrequently requested by

a query, whileLineItem is small and can be constructed

quickly and efﬁciently without BLOB’s

We chose not to store Blobs in the same tables as data to

avoid unnecessary increase in table size, since Blob structures

can be fairly large In fact, a Blob has similar size to the XML

subtree that it encodes The size of an XML document (without

the header and whitespace) can be computed as

where E N is the number of elements, E Sizeis the average size

of the element tag, T N is how many elements contain text (i.e

leafs) and T Sizeis the average text size The size of a BLOB

is:

BLOB Size = E N ∗ (E Size + 10) + T N ∗ (T Size+ 3)

The separate Blobs table also gives us an option of using

a separate SQL query to retrieve Blobs which improves the

query response time

5 XML query processing

We represent XML queries with a tree notation similar to

loto-ql [29] The query notation facilitates explanation of query

pro-cessing and corresponds to FOR-WHERE-RETURN queries

of the XQuery standard [39]

Deﬁnition 12 (Query) A query is a tuple C, E, R, where C

is called condition tree, E is called condition expression, and

• Element nodes that are labeled with labels from L.

Each element node n may also be labeled with a

• Union nodes The same set of variables must occur

in all children subtrees of a Union node Two nodes

cannot be labeled with the same variable, unless their

lowest common ancestor is a Union node.

connectives, constants, and variables that occur in C.

leaf nodes are labeled either with variables that occur in

“group-by" labels consisting of one or more variables that occur

group-by label of l or the group-by label of an ancestor of l.

♦

The query semantics are based on ﬁrst matching the tion tree with the XML data to obtain bindings and then usingthe result tree to structure the bindings into the XML result.The semantics of the condition tree are deﬁned in twosteps First, we remove Union nodes and produce a for-

est of conjunctive condition trees, by traversing the

condi-tion tree bottom-up and replacing each Union node deterministically by one of its children This process is similar

non-to producing a disjunctive normal form of a logical expression.Set of bindings produced by the condition tree is deﬁned as aunion of sets of bindings produced by each of the conjunctivecondition trees

Formally, let C be a condition tree of a query and t be the XML document tree Let V ar(C) be the set of variables in C Let C1 C l be a set of all conjunctive condition trees of C Note that V ar (C) = V ar(C i ), ∀i ∈ [1, l].A variable binding

ˆβ maps each variable of V ar(C) to a node of t The set of

variable bindings is computed based on the set of condition tree

bindings A condition tree binding β maps each node n of some conjunctive condition tree C i to a node of t The condition tree binding is valid if β(root(C i )) = root(t) and recursively, traversing C depth-ﬁrst left-to-right, for each child c jof a node

c ∈ C i , assuming c is mapped to x ∈ t, there exists a child x j

of x such that β(c j )) = x j and label(c j ) = label(x j).The set of variable bindings consists of all bindings ˆβ =

binding β = [c1 → x1, , c n → x n , ], such that V1 =

V ar (c1), , V n = V ar(c n)

The condition expression E is evaluated using the

bind-ing values and if it evaluates to true, the variable bindbind-ing isqualiﬁed Notice that the variables bind to XML elements andnot to their content values In order to evaluate the conditionexpression, all variables are coerced to the content values ofthe elements to which they bind For example, in Fig 13 the

variable P binds to an XML element “price" However, when

evaluating the condition expression we use the integer value

of “price"

Once a set of qualiﬁed bindings is identiﬁed, the resultingXML document tree is constructed by structural recursion on

the result tree R as follows The recursion starts at the root

of R with the full set of qualiﬁed bindings B Traversing R top-down, for each sub-tree R(n) rooted at node n, given a partial set of bindings B (we explain how B gets constructed

next) we construct a forest F (n, B ) following one of the casesbelow:

Label: If n consists of a tag label L without a

group-by label, the result is an XML tree with root

labeled L The list of children of the root is the concatenation F (n1, B )# #F (n m , B ), where

chil-dren, the partial set of bindings is B = B .

Trang 40

Fig 14 The XQuery equivalent to the query of Fig 13

Group-By: If n is of the form L {V1, , V k }, where

XML tree T v1, ,v k for each distinct set v1, , v kof

val-ues of V1, , V k in B Each T v1, ,v khas its root labeled

L The list of children of the root is the concatenation

is the set of variables that occur in the tree rooted at n i

Leaf Group-By: If n is a leaf node of form

V {V1, , V k }, the result is a list of values of V , for each

distinct set v1, , v k of values of V1, , V k in B .

Leaf Variable: If n is a single variable V , and V binds to an

element E in B , the result is E If the query plan is valid,

B will contain only a single tuple.

The result of the query is the forest F (r, B), where r is the

root of the result tree and B is the set of bindings delivered by

the condition tree and condition expression However, since

in our work we want to enforce that the result is a single XML

tree, we require that r does not have a “group-by" label.

Example 4 The condition tree and expression of the query

of Fig 13 retrieve tuplesN, O where N is the Name

ele-ment of a Customer eleele-ment with an Order O that has at least

one LineItem that has Price greater than 30000 For each

and the O This is essentially query number 18 of the TPC-H

benchmark suite [7], modiﬁed not to aggregate across

lineit-ems of the order It is equivalent to the XQuery of Fig 14

For example, if the query is executed on data of Fig 2,

the following set of bindings is produced, assuming that the

Order elements are BLOB-ed.

Numbers in subscript indicate node ID’s of the elements;

square brackets denote values of atomic elements and

subele-ments of complex elesubele-ments First, a single root element is created Then, the group-by on the Result node partitions the bindings into two groups (for Order3 and Order13), and

creates a Result element for each group The second group-by creates two Order elements from the following two sets of

Result102[Name29[‘‘Customer1"],Order13[ .]

]]

We can extend our query semantics to ordered XMLmodel To support order-preserving XML semantics, group-

by operators will produce lists, given sorted lists of sourcebindings In particular the group-by operator will order theoutput elements according to the node ID’s of the bindings ofthe group-by variables For example, the group-by in query ofFig 13 will produces lists of pairs of names and orders, sorted

by name ID and order ID

5.1 Query processing

Figure 15 illustrates the typical query processing steps lowed by XML databases built on relational databases; the ar-chitecture of XCacheDB is indeed based on the one of Fig 15

fol-The plan generator receives an XML query and a schema composition It produces a plan, which consists of the con-

de-dition tree, the conde-dition expression, the plan decomposition,

Constructor XML Results

Schema Decompositoin

Schema Info

RDBMS

Data Storage

Fig 15 Query processing architecture

Định dạng
Số trang	136
Dung lượng	2,17 MB