We then address evaluation algorithms, comparing the appli-cability of various algorithms to the temporal join operators and describing a performance study involving algorithms for one i
Trang 1Join operations in temporal databases
Dengfeng Gao1, Christian S Jensen2, Richard T Snodgrass1, Michael D Soo3
1 Computer Science Department, P.O Box 210077, University of Arizona, Tucson, AZ 85721-0077, USA
e-mail:{dgao,rts}@cs.arizona.edu
2 Department of Computer Science, Aalborg University, Fredrik Bajers Vej 7E, 9220 Aalborg Ø, Denmark
e-mail: csj@cs.auc.dk
3Amazon.com, Seattle; e-mail: soo@amazon.com
Edited by T Sellis Received: October 17, 2002 / Accepted: July 26, 2003
Published online: October 28, 2003 – c Springer-Verlag 2003
Abstract Joins are arguably the most important relational
operators Poor implementations are tantamount to
comput-ing the Cartesian product of the input relations In a temporal
database, the problem is more acute for two reasons First,
con-ventional techniques are designed for the evaluation of joins
with equality predicates rather than the inequality predicates
prevalent in valid-time queries Second, the presence of
tempo-rally varying data dramatically increases the size of a database
These factors indicate that specialized techniques are needed
to efficiently evaluate temporal joins
We address this need for efficient join evaluation in
tempo-ral databases Our purpose is twofold We first survey all
previ-ously proposed temporal join operators While many temporal
join operators have been defined in previous work, this work
has been done largely in isolation from competing
propos-als, with little, if any, comparison of the various operators
We then address evaluation algorithms, comparing the
appli-cability of various algorithms to the temporal join operators
and describing a performance study involving algorithms for
one important operator, the temporal equijoin Our focus, with
respect to implementation, is on non-index-based join
algo-rithms Such algorithms do not rely on auxiliary access paths
but may exploit sort orderings to achieve efficiency
Keywords: Attribute skew – Interval join – Partition join –
Sort-merge join – Temporal Cartesian product – Temporal join
– Timestamp skew
1 Introduction
Time is an attribute of all real-world phenomena
Conse-quently, efforts to incorporate the temporal domain into
database management systems (DBMSs) have been
ongo-ing for more than a decade [39,55] The potential benefits of
this research include enhanced data modeling capabilities and
more conveniently expressed and efficiently processed queries
over time
Whereas most work in temporal databases has
concen-trated on conceptual issues such as data modeling and query
languages, recent attention has been on related issues, most notably indexing and query processingstrategies In this paper, we consider an important subproblem
implementation-of temporal query processing, the evaluation ad hoc temporaljoin operations, i.e., join operations for which indexing or sec-ondary access paths are not available or appropriate Temporalindexing, which has been a prolific research area in its ownright [44], and query evaluation algorithms that exploit suchtemporal indexes are beyond the scope of this paper
Joins are arguably the most important relational operators.This is so because efficient join processing is essential for theoverall efficiency of a query processor Joins occur frequentlydue to database normalization and are potentially expensive tocompute [35] Poor implementations are tantamount to com-puting the Cartesian product of the input relations In a tem-poral database, the problem is more acute Conventional tech-niques are aimed at the optimization of joins with equalitypredicates, rather than the inequality predicates prevalent intemporal queries [27] Moreover, the introduction of a timedimension may significantly increase the size of the database.These factors indicate that new techniques are required to ef-ficiently evaluate joins over temporal relations
This paper aims to present a comprehensive and systematicstudy of join operations in temporal databases, including bothsemantics and implementation Many temporal join operatorshave been proposed in previous research, but little compari-son has been performed with respect to the semantics of theseoperators Similarly, many evaluation algorithms supportingthese operators have been proposed, but little analysis has ap-peared with respect to their relative performance, especially
in terms of empirical study
The main contributions of this paper are the following:
• To provide a systematic classification of temporal join
op-erators as natural extensions of conventional join tors
opera-• To provide a systematic classification of temporal join
evaluation algorithms as extensions of common relationalquery evaluation paradigms
• To empirically quantify the performance of the temporal
join algorithms for one important, frequently occurring,and potentially expensive temporal operator
Trang 2Our intention is for DBMS vendors to use the
contribu-tions of this paper as part of a migration path toward
incorpo-rating temporal support into their products Specifically, we
show that nearly all temporal query evaluation work to date
has extended well-accepted conventional operators and
eval-uation algorithms In many cases, these operators and
tech-niques can be implemented with small changes to an existing
code base and with acceptable, though perhaps not optimal,
performance
Research has identified two orthogonal dimensions of time
in databases – valid time, modeling changes in the real world,
and transaction time, modeling the update activity of the
database [23,51] A database may support none, one, or both
of the given time dimensions In this paper, we consider only
single-dimension temporal databases, so-called valid-time and
transaction-time databases Databases supporting both time
dimensions, so-called bitemporal databases, are beyond the
scope of this paper, though many of the described techniques
extend readily to bitemporal databases We will use the terms
snapshot, relational, or conventional database to refer to
data-bases that provide no integrated support for time
The remainder of the paper is organized as follows We
propose a taxonomy of temporal join operators in Sect 2
This taxonomy extends well-established relational operators
to the temporal context and classifies all previously defined
temporal operators In Sect 3, we develop a corresponding
taxonomy of temporal join evaluation algorithms, all of which
are non-index-based algorithms The next section focuses on
engineering the algorithms It turns out that getting the details
right is essential for good performance In Sect 5, we
empiri-cally investigate the performance of the evaluation algorithms
with respect to one particular, and important, valid-time join
operator The algorithms are tested under a variety of resource
constraints and database parameters Finally, conclusions and
directions for future work are offered in Sect 6
2 Temporal join operators
In the past, temporal join operators were defined in different
temporal data models; at times the essentially same operators
were even given different names when defined in different
models Further, the existing join algorithms have also been
constructed within the contexts of different data models This
section enables the comparison of join definitions and
imple-mentations across data models We thus proceed to propose
a taxonomy of temporal joins and then use this taxonomy to
classify all previously defined temporal joins
We take as our point of departure the core set of
conven-tional relaconven-tional joins that have long been accepted as
“stan-dard” [35]: Cartesian product (whose “join predicate” is the
constant expression TRUE), theta join, equijoin, natural join,
left and right outerjoin, and full outerjoin For each of these, we
define a temporal counterpart that is a natural, temporal
gener-alization of it This genergener-alization hinges on the notion of
snap-shot equivalence [26], which states that two temporal relations
are equivalent if they consist of the same sequence of
time-indexed snapshots We note that some other join operators do
exist, including semijoin, antisemijoin, and difference Their
temporal counterparts have been explored elsewhere [11] and
are not considered here
Having defined this set of temporal joins, we show howall previously defined operators are related to this taxonomy
of temporal joins The previous operators considered includeCartesian product,Θ-JOIN, EQUIJOIN, NATURAL
JOIN, TIME JOIN [6,7], TE JOIN, TE OUTERJOIN, andEVENT JOIN[20,46,47,52] and those based on Allen’s [1]interval relations ([27,28,36]) We show that many of theseoperators incorporate less restrictive predicates or use spe-cialized attribute semantics and thus are variants of one of thetaxonomic joins
2.1 Temporal join definitions
To be specific, we base the definitions on a single data model
We choose the model that is used most widely in ral data management implementations, namely, the one thattimestamps each tuple with an interval We assume that thetimeline is partitioned into minimal-duration intervals, termed
tempo-chronons [12], and we denote intervals by inclusive starting
and ending chronons
We define two temporal relational schemas, R and S, as
follows
R = (A1, , A n ,Ts ,Te)
S = (B1, , B m ,Ts ,Te)
The A i 1 ≤ i ≤ n and B i 1 ≤ i ≤ m are the explicit
and Teare the timestamp start and end attributes, recordingwhen the information recorded by the explicit attributes holds(or held or will hold) true We will use T as shorthand for theinterval[Ts ,Te ] and A and B as shorthand for {A1, , A n }
instances of R and S, respectively.
Example 1 Consider the following two temporal relations The
relations show the canonical example of employees, the partments they work for, and the managers who supervisethose departments
de-EmployeeEmpName Dept T
2.2 Cartesian product
The temporal Cartesian product is a conventional Cartesianproduct with a predicate on the timestamp attributes To define
it, we need two auxiliary definitions
First, intersect(U, V ), where U and V are
inter-vals, returns TRUE if there exists a chronon t such that
Trang 3t ∈ U ∧ t ∈ V Second, overlap(U, V ) returns the
max-imum interval contained in its two argument intervals If no
nonempty intervals exist, the function returns∅ To state this
more precisely, letfirst and last return the smallest and largest
of two argument chronons, respectively Also, let U s and U e
denote, respectively, the starting and ending chronons of U ,
and similarly for V
Definition 1 The temporal Cartesian product, r ×Ts, of two
temporal relations r and s is defined as follows.
r ×Ts = {z (n+m+2) | ∃x ∈ r ∃y ∈ s (
z [A] = x[A] ∧ z[B] = y[B] ∧
z [T] = overlap(x[T], y[T]) ∧ z[T] = ∅)}
The second line of the definition sets the explicit attribute
val-ues of the result tuple z to the concatenation of the explicit
attribute values of x and y The third line computes the
time-stamp of z and ensures that it is nonempty
Example 2 Consider the query “Show the names of employees
and managers where the employee worked for the company
while the manager managed some department in the
com-pany.” This can be satisfied using the temporal Cartesian
prod-uct
Employee ×TManager
The overlap function is necessary and sufficient to ensure
snapshot reducibility, as will be discussed in detail in Sect 2.7.
Basically, we want the temporal Cartesian product to act as
though it is a conventional Cartesian product applied
inde-pendently at each point in time When operating on
interval-stamped data, this semantics corresponds to an intersection:
the result will be valid during those times when contributing
tuples from both input relations are valid.
The temporal Cartesian product was first defined by Segev
and Gunadhi [20,47] This operator was termed the time
join, and the abbreviation T-join was used Clifford and
Croker [7] defined a Cartesian product operator that is a
com-bination of the temporal Cartesian product and the temporal
outerjoin, to be defined shortly Interval join is a building block
of the (spatial) rectangle join [2] The interval join is a
one-dimensional spatial join that can thus be used to implement
the temporal Cartesian product
2.3 Theta join
Like the conventional theta join, the temporal theta join
sup-ports an unrestricted predicate P on the explicit attributes of
its input arguments The temporal theta join, r 1T
P s, of two
relations r and s selects those tuples from r ×Tsthat satisfy
predicate P (r[A], s[B]) Let σ denote the standard selection
A form of this operator, the Θ-JOIN, was proposed by
Clifford and Croker [6] This operator was later extended toallow computations more general than overlap on the time-stamps of result tuples [53]
2.4 Equijoin
Like snapshot equijoin, the temporal equijoin operator forces equality matching among specified subsets of the ex-plicit attributes of the input relations
en-Definition 3 The temporal equijoin on two temporal relations
r and s on attributes A ⊆ A and B ⊆ B is defined as the
theta join with predicate P ≡ r[A ] = s[B ]
Like the temporal theta join, the temporal equijoin wasfirst defined by Clifford and Croker [6] A specialized oper-ator, the TE-join, was developed independently by Segevand Gunadhi [47] The TE-join required the explicit joinattribute to be a surrogate attribute of both input relations.Essentially, a surrogate attribute would be a key attribute of
a corresponding nontemporal schema In a temporal context,
a surrogate attribute value represents a time-invariant object
identifier If we augment schemas R and S with surrogate tributes ID, then the TE-join can be expressed using the
at-temporal equijoin as follows
r[ ID ]=s[ ID]sThe temporal equijoin was also generalized by Zhang et al
to yield the generalized TE-join, termed the GTE-join, which
specifies that the joined tuples must have their keys in a ified range while their intervals should intersect a specifiedinterval [56] The objective was to focus on tuples within in-teresting rectangles in the key-time space
spec-2.5 Natural join
The temporal natural join and the temporal equijoin bear thesame relationship to one another as their snapshot counter-parts That is, the temporal natural join is simply a temporalequijoin on identically named explicit attributes followed by
a subsequent projection operation
To define this join, we augment our relation schemas with
explicit join attributes, C i 1 ≤ i ≤ k, which we abbreviate
Trang 4r 1Ts = {z (n+m+k+2) | ∃x ∈ r ∃y ∈ s(x[C] = y[C]∧
z [A] = x[A] ∧ z[B] = x[B] ∧ z[C] = y[C]∧
z [T] = overlap(x[T], y[T]) ∧ z[T] = ∅)}
The first two lines ensure that tuples x and y agree on the values
of the join attributes C and set the explicit attributes of the
result tuple z to the concatenation of the nonjoin attributes A
and B and a single copy of the join attributes, C The third line
computes the timestamp of z as the overlap of the timestamps
of x and y and ensures that x[T] and y[T] actually overlap
This operator was first defined by Clifford and Croker [6],
who named it the natural time join We showed in
ear-lier work that the temporal natural join plays the same
impor-tant role in reconstructing normalized temporal relations as the
snapshot natural join for normalized snapshot relations [25]
Most previous work in temporal join evaluation has addressed,
either implicitly or explicitly, the implementation of the
tem-poral natural join or the closely related temtem-poral equijoin
2.6 Outerjoins and outer Cartesian products
Like the snapshot outerjoin, temporal outerjoins and Cartesian
products retain dangling tuples, i.e., tuples that do not
partic-ipate in the join However, in a temporal database, a tuple
may dangle over a portion of its time interval and be covered
over others; this situation must be accounted for in a temporal
outerjoin or Cartesian product
We may define the temporal outerjoin as the union of two
subjoins, like the snapshot outerjoin The two subjoins are the
temporal left outerjoin and the temporal right outerjoin As the
left and right outerjoins are symmetric, we define only the left
outerjoin
We need two auxiliary functions The coalesce function
collapses value-equivalent tuples – tuples with mutually equal
nontimestamp attribute values [23] – in a temporal relation
into a single tuple with the same nontimestamp attribute
val-ues and a timestamp that is the finite union of intervals that
precisely contains the chronons in the timestamps of the
value-equivalent tuples (A finite union of time intervals is termed
a temporal element [15], which we represent in this paper as
a set of chronons.) The definition of coalesce uses the
func-tion chronons that returns the set of chronons contained in the
argument interval
∃x∈ r(z[A] = x[A] ⇒ chronons(x[T]) ⊆ z[T]∧
The second and third lines of the definition coalesce all
value-equivalent tuples in relation r The last line ensures that no
spurious chronons are generated
We now define a function expand that returns the set of
maximal intervals contained in an argument temporal element,
The second line ensures that a member of the result is an
interval contained in T The last two lines ensure that the
interval is indeed maximal
We are now ready to define the temporal left outerjoin
Let R and S be defined as for the temporal equijoin We use
Definition 5 The temporal left outerjoin, r 1T
r[A ]=s[B ]s, of
two temporal relations r and s is defined as follows.
r[A ]=s[B ]s = {z (n+m+2) |
∃x ∈ coalesce(r) ∃y ∈ coalesce(s)
(x[A ] = y[B ] ∧ z[A] = x[A] ∧ z[T] = ∅ ∧ ((z[B] = y[B] ∧ z[T] ∈ expand(x[T] ∩ y[T])) ∨ (z[B] = null ∧ z[T] ∈ expand(x[T] − y[T])))) ∨
∃x ∈ coalesce(r) ∀y ∈ coalesce(s)
(x[A ] = y[B ] ⇒ z[A] = x[A] ∧ z[B] = null ∧
z [T] ∈ expand(x[T ]) ∧ z[T] = ∅)}
The first five lines of the definition handle the case where,
for a tuple x deriving from the left argument, a tuple y with
matching explicit join attribute values is found For those time
intervals of x that are not shared with y, we generate tuples with null values in the attributes of y The final three lines of the definition handle the case where no matching tuple y is found Tuples with null values in the attributes of y are generated
The temporal outerjoin may be defined as simply the union
of the temporal left and the temporal right outerjoins (the unionoperator eliminates the duplicate equijoin tuples) Similarly,
a temporal outer Cartesian product is a temporal outerjoin
without the equijoin condition (A = B = ∅).
Gunadhi and Segev were the first researchers to gate outerjoins over time They defined a specialized version
investi-of the temporal outerjoin called the EVENT JOIN [47] Thisoperator, of which the temporal left and right outerjoins werecomponents, used a surrogate attribute as its explicit join at-tribute This definition was later extended to allow any at-tributes to serve as the explicit join attributes [53] A spe-cialized version of the left and right outerjoins called theTE-outerjoinwas also defined The TE-outerjoin in-corporated the TE-join, i.e., temporal equijoin, as a com-ponent
Clifford and Croker [7] defined a temporal outer Cartesianproduct, which they termed simply Cartesian product
2.7 Reducibility
We proceed to show how the temporal operators reduce tosnapshot operators Reducibility guarantees that the seman-tics of the snapshot operator is preserved in its more complextemporal counterpart
For example, the semantics of the temporal natural joinreduces to the semantics of the snapshot natural join in thatthe result of first joining two temporal relations and then trans-forming the result to a snapshot relation yields a result that isthe same as that obtained by first transforming the arguments
to snapshot relations and then joining the snapshot relations.This commutativity diagram is shown in Fig 1 and stated for-mally in the first equality of the following theorem
Trang 5Fig 1 Reducibility of temporal
nat-ural join to snapshot natnat-ural join
The timeslice operation τTtakes a temporal relation r as
argument and a chronon t as parameter It returns the
corre-sponding snapshot relation, i.e., with the schema of r but
with-out the timestamp attributes, that contains (the nontime stamp
portion of) all tuples x from r for which t belongs to x[T] It
follows from the theorem below that the temporal joins defined
here reduce to their snapshot counterparts
Theorem 1 Let t denote a chronon and let r and s be relation
instances of the proper types for the operators they are applied
to Then the following hold for all t.
Proof: An equivalence is shown by proving its two
inclu-sions separately The nontimestamp attributes of r and s are
AC and BC, respectively, where A, B, and C are sets of
at-tributes and C denotes the join attribute(s) (cf the definition
of temporal natural join) We prove one inclusion of the first
equivalence, that is, τT
t (r 1 s) (the left-hand side of the
equiv-alence to be proved) Then there is a tuple x ∈ r 1Ts
such that x [ABC] = x and t ∈ x [T] By the definition
of 1T, there exist tuples x1 ∈ r and x2 ∈ s such that
We have defined a taxonomy for temporal join operators The
taxonomy was constructed as a natural extension of
corre-sponding snapshot database operators We also briefly
de-scribed how previously defined temporal operators are
accom-modated in the taxonomy
Table 1 summarizes how previous work is represented
in the taxonomy For each operator defined in
previ-ous work, the table lists the defining publication,
re-searchers, the corresponding taxonomy operator, and any
restrictions assumed by the original operators In earlywork, Clifford [8] indicated that an INTERSECTION JOINshould be defined that represents the categorized nonouterjoins and Cartesian products, and he proposed that a UNIONJOINbe defined for the outer variants
3 Evaluation algorithms
In the previous section, we described the semantics of all viously proposed temporal join operators We now turn ourattention to implementation algorithms for these operators Asbefore, our purpose is to enumerate the space of algorithmsapplicable to the temporal join operators, thereby providing
pre-a consistent frpre-amework within which existing temporpre-al joinevaluation algorithms can be placed
Our approach is to extend well-understood paradigmsfrom conventional query evaluation to temporal databases.Algorithms for temporal join evaluation are necessarily morecomplex than their snapshot counterparts Whereas snapshotevaluation algorithms match input tuples based on their ex-plicit join attributes, temporal join evaluation algorithms typ-ically must additionally ensure that temporal restrictions aremet Furthermore, this problem is exacerbated in two ways.Timestamps are typically complex data types, e.g., intervalsrequiring inequality predicates, which conventional query pro-cessors are not optimized to handle Also, a temporal database
is usually larger than a corresponding snapshot database due
to the versioning of tuples
We consider non-index-based algorithms Index-based gorithms use an auxiliary access path, i.e., a data structure thatidentifies tuples or their locations using a join attribute value.Non-index-based algorithms do not employ auxiliary accesspaths While some attention has been focused on index-basedtemporal join algorithms, the large number of temporal in-dexes that have been proposed in the literature [44] precludes
al-a thorough investigal-ation in this pal-aper
We first provide a taxonomy of temporal join algorithms.This taxonomy, like the operator taxonomy of Table 1, is based
on well-established relational concepts Sections 3.2 and 3.3describe the algorithms in the taxonomy and place existingwork within the given framework Finally, conclusions areoffered in Sect 3.4
3.1 Evaluation taxonomy
All binary relational query evaluation algorithms, includingthose computing conventional joins, are derived from four
Trang 6Table 1 Temporal join operators
Operator Initial citation Taxonomy operator Restrictions
Cartesian product [7] Outer Cartesian product None
Restrictions:
1 = restricts also the valid time of the result tuples
2 = matching only on surrogate attributes
3 = includes also intersection predicates with an argument surrogate range and a time range
basic paradigms: nested-loop, partitioning, sort-merge, and
index-based [18]
Partition-based join evaluation divides the input tuples into
buckets using the join attributes of the input relations as key
values Corresponding buckets of the input relations contain
all tuples that could possibly match with one another, and
the buckets are constructed to best utilize the available main
memory buffer space The result is produced by performing
an in-memory join of each pair of corresponding buckets from
the input relations
Sort-merge join evaluation also divides the input relation
but uses physical memory loads as the units of division The
memory loads are sorted, producing sorted runs, and written to
disk The result is produced by merging the sorted runs, where
qualifying tuples are matched and output tuples generated
Index-based join evaluation utilizes indexes defined on the
join attributes of the input relations to locate joining tuples
ef-ficiently The index could be preexisting or built on the fly
Elmasri et al presented a temporal join algorithm that
uti-lizes a two-level time index, which used a B+-tree to index
the explicit attribute in the upper level, with the leaves
ref-erencing other B+-trees indexing time points [13] Son and
Elmasri revised the time index to require less space and used
this modified index to determine the partitioning intervals in a
partition-based timestamp algorithm [52] Bercken and Seeger
proposed several temporal join algorithms based on a
multi-version B+-tree (MVBT) [4] Later Zhang et al described
several algorithms based on B+-trees, R∗-trees [3], and the
MVBT for the related GTE-join This operation requires that
joined tuples have key values that belong to a specified range
and have time intervals that intersect a specified interval [56]
The MVBT assumes that updates arrive in increasing time
or-der, which is not the case for valid-time data We focus on
non-index-based join algorithms that apply to both valid-time
and transaction-time relations, and we do not discuss these
index-based joins further
We adapt the basic non-index-based algorithms
(nested-loop, partitioning, and sort-merge) to support
temporal joins To enumerate the space of temporal join
algo-rithms, we exploit the duality of partitioning and sort-merge[19] In particular, the division step of partitioning, wheretuples are separated based on key values, is analogous to themerging step of sort-merge, where tuples are matched based
on key values In the following, we consider the istics of sort-merge algorithms and apply duality to derivecorresponding characteristics of partition-based algorithms.For a conventional relation, sort-based join algorithms or-der the input relation on the input relations’ explicit join at-tributes For a temporal relation, which includes timestampattributes in addition to explicit attributes, there are four pos-sibilities for ordering the relation First, the relation can besorted by the explicit attributes exclusively Second, the rela-tion can be ordered by time, using either the starting or endingtimestamp [29,46] The choice of starting or ending timestampdictates an ascending or descending sort order, respectively.Third, the relation can be ordered primarily by the explicit at-tributes and secondarily by time [36] Finally, the relation can
character-be ordered primarily by time and secondarily by the explicitattributes
By duality, the division step of partition-based algorithmscan partition using any of these options [29,46] Hence fourchoices exist for the dual steps of merging in sort-merge orpartitioning in partition-based methods
We use this distinction to categorize the different proaches to temporal join evaluation The first approach above,using the explicit attributes as the primary matching attributes,
we term explicit algorithms Similarly, we term the second proach timestamp algorithms We retain the generic term tem-
ap-poral algorithm to mean any algorithm to evaluate a temap-poral
operator
Finally, it has been recognized that the choice of bufferallocation strategy, GRACE or hybrid [9], is independent ofwhether a sort-based or partition-based approach is used [18].Hybrid policies retain most of the last run of the outer relation
in main memory and so minimize the flushing of intermediatebuffers to disk, thereby potentially decreasing the I/O cost.Figure 2 lists the choices of sort-merge vs partitioning,the possible sorting/partitioning attributes, and the possible
Trang 7
×
GRACEHybrid
Fig 2 Space of possible evaluation algorithms
buffer allocation strategies Combining all possibilities yields
16 possible evaluation algorithms Including the basic
nested-loop algorithm and GRACE and hybrid variants of the
sort-based interval join mentioned in Sect 2.2 results in a total
of 19 possible algorithms The 19 algorithms are named and
described in Table 2
We noted previously that time intervals lack a natural
or-der From this point of view spatial join is similar because
there is no natural order preserving spatial closeness Previous
work on spatial join may be categorized into three approaches
Early work [37,38] used a transformation approach based on
space-filling curves, performing a sort-merge join along the
curve to solve the join problem Most of the work falls in the
index-based approaches, utilizing spatial index structures such
as the R-tree [21], R+-tree [48], R∗-tree [3], Quad-tree [45],
or seeded tree [31] While some algorithms use preexisting
indexes, others build the indexes on the fly
In recent years, some work has focused on
non-index-based spatial join approaches Two partition-non-index-based
spa-tial join algorithms have been proposed One of them
[32] partitions the input relations into overlapping buckets and
uses an indexed nested-loop join to perform the join within
each bucket The other [40] partitions the input relations into
disjoint partitions and uses a computational-geometry-based
plane-sweep algorithm that can be thought of as the spatial
equivalent of the sort-merge algorithm Arge et al [2]
intro-duced a highly optimized implementation of the
sweeping-based algorithm that first sorts the data along the vertical axis
and then partitions the input into a number of vertical strips
Data in each strip can then be joined by an internal
plane-sweep algorithm All the above non-index-based spatial join
algorithms use a sort- or partition-based approach or combine
these two approaches in one algorithm, which is the approach
we adopt in some of our temporal join algorithms (Sect 4.3.2)
In the next two sections, we examine the space of explicit
algorithms and timestamp algorithms, respectively, and
clas-sify existing approaches using the taxonomy developed in this
section We will see that most previous work in temporal join
evaluation has centered on timestamp algorithms However,
for expository purposes, we first examine those algorithms
based on manipulation of the nontimestamp columns, which
we term “explicit” algorithms
3.2 Explicit algorithms
Previous work has largely ignored the fact that conventional
query evaluation algorithms can be easily modified to
eval-uate temporal joins In this section, we show how the three
paradigms of query evaluation can support temporal join
eval-uation To make the discussion concrete, we develop an
algo-rithm to evaluate the valid-time natural join, defined in Sect 2,
for each of the three paradigms We begin with the simplest
paradigm, nested-loop evaluation
for each tuple x ∈ b r
for each tuple y ∈ b s
The algorithm operates as follows One relation is
des-ignated the outer relation, the other the inner relation
[35,18] The outer relation is scanned once For each block
of the outer relation, the inner relation is scanned When ablock of the inner relation is read into memory, the tuples
in that “inner block” are joined with the tuples in the “outerblock.”
The temporal nested-loop join is easily constructed fromthis basic algorithm All that is required is that the timestamppredicate be evaluated at the same time as the predicate on theexplicit attributes Figure 3 shows the temporal algorithm (In
the figure, r is the outer relation and s is the inner relation We
assume their schemas are as defined in Sect 2.)While conceptually simple, nested-loop-based evaluation
is often not competitive due to its quadratic cost We nowdescribe temporal variants of the sort-merge and partition-based algorithms, which usually exhibit better performance
3.2.2 Sort-merge-based algorithms
Sort-merge join algorithms consist of two phases In the first
phase, the input relations r and s are sorted by their join
at-tributes In the second phase, the result is produced by
simulta-neously scanning r and s, merging tuples with identical values
for their join attributes
Complications arise if the join attributes are not key tributes of the input relations In this case, multiple tuples in
at-r and in s may have identical join attribute values Hence a given r tuple may join with many s tuples, and vice versa (This is termed skew [30].)
As before, we designate one relation as the outer relationand the other as the inner relation When consecutive tuples in
Trang 8Table 2 Algorithm taxonomy
Hybrid explicit sort ES-H Hybrid sort-merge by explicit attributes
Hybrid timestamp sort TS-H Hybrid sort-merge by timestamps
Explicit/timestamp sort ETS GRACE sort-merge by explicit attributes/time
Hybrid explicit/timestamp sort ETS-H Hybrid sort-merge by explicit attributes/time
Timestamp/explicit sort TES GRACE sort-merge by time/explicit attributes
Hybrid timestamp/explicit sort TES-H Hybrid sort-merge by time/explicit attributes
Hybrid interval join TSI-H Hybrid sort-merge by timestamps
Explicit partitioning EP GRACE partitioning by explicit attributes
Hybrid explicit partitioning EP-H Hybrid partitioning by explicit attributes
Timestamp partitioning TP Range partition by time
Hybrid timestamp partitioning TP-H Hybrid range partitioning by time
Explicit/timestamp partitioning ETP GRACE partitioning by explicit attributes/time
Hybrid explicit/timestamp partitioning ETP-H Hybrid partitioning by explicit attributes/time
Timestamp/explicit partitioning TEP GRACE partitioning by time/explicit attributes
Hybrid timestamp/explicit partitioning TEP-H Hybrid partitioning by time/explicit attributes
structure state
integer current block;
integer current tuple;
integer first block;
integer first tuple;
block tuples;
Fig 4 State structure for merge scanning
the outer relation have identical values for their explicit join
at-tributes, i.e., their nontimestamp join atat-tributes, the scan of the
inner relation is “backed up” to ensure that all possible matches
are found Prior to showing the explicitSortMerge
al-gorithm, we define a suite of algorithms that manage the scans
of the input relations For each scan, we maintain the state
structure shown in Fig 4 The fields current block and
cur-rent tuple together indicate the curcur-rent tuple in the scan by
recording the number of the current block and the index of
the current tuple within that block The fields first block and
first tuple are used to record the state at the beginning of a
scan of the inner relation in order to back up the scan later
if needed Finally, tuples stores the block of the relation
cur-rently in memory For convenience, we treat the block as an
array of tuples
The initState algorithm shown in Fig 5 initializes the
state of a scan Essentially, counters are set to guarantee that
the first block read and the first tuple scanned are the first
block and first tuple within that block in the input relation We
assume that a seek operation is available that repositions the
file pointer associated with a relation to a given block number
The advance algorithm advances the scan of the
argu-ment relation and state to the next tuple in the sorted relation
If the current block has been exhausted, then the next block of
the relation is read Otherwise, the state is updated to mark the
next tuple in the current block as the next tuple in the scan The
initState(relation, state):
state.current block ← 1;
state.current tuple ← 0;
state.first block ←⊥;
state.first tuple ←⊥;
seek(relation, state.current block);
state.tuples ← read block(relation);
advance(relation, state):
if (state.current tuple = MAX TUPLES)
state.tuples ← read block(relation);
state.current block ← state.current block + 1; state.current tuple ← 1;
else
state.current tuple ← state.current tuple + 1;
currentTuple(state):
return state.tuples[state.current tuple]
backUp(relation, state):
if (state.current block = state.first block)
state.current block ← state.first block;
seek(relation, state.current block);
state.tuples ← read block(relation);
state.current tuple ← state.first tuple;
markScanStart(state):
state.first block ← state.current block;
state.first tuple ← state.current tuple;
Fig 5 Merge algorithms
Trang 9backUp(s , inner state);
y ← currentTuple(s , inner state);
Fig 6 explicitSortMerge algorithm
current tuplealgorithm merely returns the next tuple in
the scan, as indicated by the scan state Finally, the backUp
and markScanStart algorithms manage the backing up of
the inner relation scan The backUp algorithm reverts the
current block and tuple counters to their last values These
values are stored in the state at the beginning of a scan by the
markScanStartalgorithm
We are now ready to exhibit the explicitSortMerge
algorithm, shown in Fig 6 The algorithm accepts three
pa-rameters, the input relations r and s and the join attributes C.
We assume that the schemas of r and s are as given in Sect 2.
Tuples from the outer relation are scanned in order For each
outer tuple, if the tuple matches the previous outer tuple, the
scan of the inner relation is backed up to the first matching
inner tuple The starting location of the scan is recorded in
case backing up is needed by the next outer tuple, and the
scan proceeds forward as normal The complexity of the
al-gorithm, as well as its performance degradation as compared
with conventional sort-merge, is due largely to the
bookkeep-ing required to back up the inner relation scan We consider
this performance hit in more detail in Sect 4.2.2
Segev and Gunadhi developed three algorithms based on
explicit sorting, differing primarily by the code in the inner
loop and by whether backup is necessary Two of the
algo-rithms, TEJ-1 and TEJ-2, support the temporal equijoin [46];
the remaining algorithm, EJ-1, evaluates the temporal join [46]
outer-TEJ-1 is applicable if the equijoin condition is on thesurrogate attributes of the input relations The surrogate at-tributes are essentially key attributes of a corresponding snap-shot schema TEJ-1 assumes that the input relations are sortedprimarily by their surrogate attributes and secondarily by theirstarting timestamps The surrogate matching, sort-ordering,and 1TNF assumption described in Sect 3.3.1 allows the re-sult to be produced with a single scan of both input relations,with no backup
The second equijoin algorithm, TEJ-2, is applicable whenthe equijoin condition involves any explicit attributes, surro-gate or not TEJ-2 assumes that the input relations are sortedprimarily by their explicit join attribute(s) and secondarily bytheir starting timestamps Note that since the join attribute can
be a nonsurrogate attribute, tuples sharing the same join tribute value may overlap in valid time Consequently, TEJ-2requires the scan of the inner relation to be backed up in order
at-to find all tuples with matching explicit attributes
For the EVENT JOIN, Segev and Gunadhi described thesort-merge-based algorithm EJ-1 EJ-1 assumes that the inputrelations are sorted primarily by their surrogate attributes andsecondarily by their starting timestamps Like TEJ-1, the result
is produced by a single scan of both input relations
3.2.3 Partition-based algorithms
As in sort-merge-based algorithms, partition-based algorithmshave two distinct phases In the first phase, the input relationsare partitioned based on their join attribute values The par-titioning is performed so that a given bucket produced fromone input relation contains tuples that can only match withtuples contained in the corresponding bucket of the other in-put relation Each produced bucket is also intended to fill theallotted main memory Typically, a hash function is used asthe partitioning agent Both relations are filtered through thesame hash function, producing two parallel sets of buckets Inthe second phase, the join is computed by comparing tuples incorresponding buckets of the input relations Partition-basedalgorithms have been shown to have superior performancewhen the relative sizes of the input relations differ [18]
A partitioning algorithm for the temporal natural join isshown in Fig 7 The algorithm accepts as input two relations
r and s and the names of the explicit join attributes C We assume that the schemas of r and s are as given in Sect 2.
As can be seen, the explicit partition-based join algorithm
is conceptually very simple One relation is designated theouter relation, the other the inner relation After partitioning,each bucket of the outer relation is read in turn For a given
“outer bucket,” each page of the corresponding “inner bucket”
is read, and tuples in the buffers are joined
The partitioning step in Fig 7 is performed by thepartitionalgorithm This algorithm takes as its first argu-
ment an input relation The resulting n partitions are returned
in the remaining parameters Algorithm partition assumes that
a hash function hash is available that accepts the join attribute
values x[C] as input and returns an integer, the index of the
target bucket, as its result
Trang 10outer bucket ← read partition(r i);
for each page p ∈ s i
p ← read page(s i);
for each tuple x ∈ outer bucket
for each tuple y ∈ p
In contrast to the algorithms of the previous section, timestamp
algorithms perform their primary matching on the timestamps
associated with tuples
In this section, we enumerate, to the best of our
knowl-edge, all existing timestamp-based evaluation algorithms for
the temporal join operators described in Sect 3 Many of these
algorithms assume sort ordering of the input by either their
starting or ending timestamps While such assumptions are
valid for many applications, they are not valid in the general
case, as valid-time semantics allows correction and deletion
of previously stored data (Of course, in such cases one could
resort within the join.) As before, all of the algorithms
de-scribed here are derived from nested loop, sort-merge, or
par-titioning; we do not consider index-based temporal joins
3.3.1 Nested-loop-based timestamp algorithms
One timestamp nested-loop-based algorithm has been
pro-posed for temporal join evaluation Like the EJ-1 algorithm
described in the previous section, Segev and Gunadhi
devel-oped their algorithm, EJ-2, for the EVENT JOIN [47,20]
(Ta-ble 1)
EJ-2 does not assume any ordering of the input relations
It does assume that the explicit join attribute is a distinguished
surrogate attribute and that the input relations are in TemporalFirst Normal Form (1TNF) Essentially, 1TNF ensures thattuples within a single relation that have the same surrogatevalue may not overlap in time
EJ-2 simultaneously produces the natural join and left erjoin in an initial phase and then computes the right outerjoin
out-in a subsequent phase
For the first phase, the inner relation is scanned once fromfront to back for each outer relation tuple For a given outerrelation tuple, the scan of the inner relation is terminated whenthe inner relation is exhausted or the outer tuple’s timestamphas been completely overlapped by matching inner tuples Theouter tuple’s natural join is produced as the scan progresses.The outer tuple’s left outerjoin is produced by tracking thesubintervals of the outer tuple’s timestamp that are not over-lapped by any inner tuples An output tuple is produced foreach subinterval remaining at the end of the scan Note thatthe main memory buffer space must be allocated to containthe nonoverlapped subintervals of the outer tuple
In the second phase, the roles of the inner and outer tions are reversed Now, since the natural join was producedduring the first phase, only the right outerjoin needs to becomputed The right outerjoin tuples are produced in the samemanner as above, with one small optimization If it is knownthat a tuple of the (current) outer relation did not join withany tuples during the first phase, then no scanning of the innerrelation is required and the corresponding outerjoin tuple isproduced immediately
rela-Incidentally, Zurek proposed several algorithms for
eval-uating temporal Cartesian product on multiprocessors based
on nested loops [57]
3.3.2 Sort-merge-based timestamp algorithms
To date, four sets of researchers – Segev and Gunadhi, Leungand Muntz, Pfoser and Jensen, and Rana and Fotouhi – havedeveloped timestamp sort-merge algorithms Additionally, aone-dimensional spatial join algorithm proposed by Arge et
al can be used to implement a temporal Cartesian product.Segev and Gunadhi modified the traditional merge-joinalgorithm to support the T-join and the temporal equijoin [47,20] We describe the algorithms for each of these operators inturn
For the T-join, the relations are sorted in ascending order
of starting timestamp The result is produced by a single scan
of the input relations
For the temporal equijoin, two timestamp sorting rithms, named TEJ-3 and TEJ-4, are presented Both TEJ-3and TEJ-4 assume that their input relations are sorted by start-ing timestamp only TEJ-4 is applicable only if the equijoincondition is on the surrogate attribute In addition to assumingthat the input relations are sorted by their starting timestamps,TEJ-4 assumes that all tuples with the same surrogate valueare linked, thereby allowing all tuples with the same surrogate
algo-to be retrieved when the first is found The result is performedwith a linear scan of both relations, with random access needed
to traverse surrogate chains
Like TEJ-2, TEJ-3 is applicable for temporal equijoins
on both the surrogate and explicit attribute values TEJ-3 sumes that the input relations are sorted in ascending order of
Trang 11as-their starting timestamps, but no sort order is assumed on the
explicit join attributes Hence TEJ-3 requires that the inner
relation scan be backed up should consecutive tuples in the
outer relation have overlapping interval timestamps
Leung and Muntz developed a series of algorithms based
on the sort-merge algorithm to support temporal join
predi-cates such as “contains” and “intersect” [1] Although their
algorithms do not explicitly support predicates on
nontem-poral attribute values, their techniques are easily modified to
support more complex join operators such as the temporal
equijoin Like Segev and Gunadhi, this work describes
evalu-ation algorithms appropriate for different sorting assumptions
and access paths
Leung and Muntz use a stream-processing approach
Ab-stractly, the input relations are considered as sequences of
time-sorted tuples where only the tuples at the front of the
streams may be read The ordering of the tuples is a tradeoff
with the amount of main memory needed to compute the join
For example, Leung and Muntz show how a contain join [1]
can be computed if the input streams are sorted in ascending
order of their starting timestamp They summarize for various
sort orders on the starting and ending timestamps what tuples
must be retained in main memory during the join
computa-tion A family of algorithms are developed assuming different
orderings (ascending/descending) of the starting and ending
timestamps
Leung and Muntz also show how checkpoints, essentially
the set of tuples valid during some chronon, can be used to
evaluate temporal joins where the join predicate implies some
overlap between the participating tuples Here, the
check-points actually contain tuple identifiers (TIDs) for the tuples
valid during the specified chronon and the TIDs of the next
tuples in the input streams Suppose a checkpoint exists at
time t Using this checkpoint, the set of tuples participating
in a join over a time interval containing t can be computed by
using the cached TIDs and “rolling forward” using the TIDs
of the next tuples in the streams
Rana and Fotouhi proposed several techniques to improve
the performance of time-join algorithms in which they claimed
they used a nested-loop approach [43] Since they assumed the
input relations were sorted by the start time and/or end time,
those algorithms are more like the second phase of
sort-merge-based timestamp algorithms The algorithms are very similar
to the sort-merge-based algorithms developed by Segev and
Gunadhi
Arge et al described the interval join, a
one-dimensional spatial join algorithm, which is a building block
of a two-dimensional rectangle join [2] Each interval is
de-fined by a lower boundary and an upper boundary The problem
is to report all intersections between an interval in the outer
relation and an interval in the inner relation If the interval
is a time interval instead of a spatial interval, this problem is
equivalent to the temporal Cartesian product They assumed
the two input relations were first sorted by the algorithm into
one list by their lower boundaries The algorithm maintains
two initially empty lists of tuples with “active” intervals, one
for each input relation When the sorted list is scanned, the
current tuple is put into the active list of the relation it
be-longs to and joins only with the tuples in the active list of the
other relation Tuples becoming inactive during scanning are
removed from the active list
Most recently, Pfoser and Jensen [41] applied the merge approach to the temporal theta join in a setting whereeach argument relation consists of a noncurrent and a currentpartition Tuples in the former all have intervals that end be-fore the current time, while all tuples of the latter have intervalsthat end at the current time They assume that updates arrive intime order, so that tuples in noncurrent partitions are ordered
sort-by their interval end times and tuples in current partitions areordered by their interval start times A join then consists ofthree different kinds of subjoins They develop two join algo-rithms for this setting and subsequently use these algorithmsfor incremental join computation
As can be seen from the above discussion, a large ber of timestamp-based sort-merge algorithms have been pro-posed, some for specific join operators However, each of theseproposals has been developed largely in isolation from otherwork, with little or no cross comparison Furthermore, pub-lished performance figures have been derived mainly fromanalytical models rather than from empirical observations Anempirical comparison, as provided in Sect 5, is needed to trulyevaluate the different proposals
num-3.3.3 Partition-based timestamp algorithms
Partitioning a relation over explicit attributes is relativelystraightforward if the partitioning attributes have discrete val-ues Partitioning over time is more difficult since our time-stamps are intervals, i.e., range data, rather than discrete val-ues Previous timestamp partitioning algorithms therefore de-
veloped various means of range partitioning the time intervals
associated with tuples
In previous work, we described a valid-time join algorithmusing partitioning [54] This algorithm was presented in thecontext of evaluating the valid-time natural join, though it iseasily adapted to other temporal joins The range partitioningused by this algorithm mapped tuples to singular buckets anddynamically migrated the tuples to other buckets as neededduring the join computation This approach avoided data re-dundancy, and associated I/O overhead, at the expense of morecomplex buffer management
Sitzmann and Stuchey extended this algorithm by usinghistograms to decide the partition boundary [49] Their algo-rithm takes the number of long-lived tuples into consideration,which renders its performance insensitive to the number oflong-lived tuples However, it relies on a preexisting temporalhistogram
Lu et al described another range-partitioning algorithmfor computing temporal joins [33] This algorithm is applica-ble to theta joins, where a result tuple is produced for each pair
of input tuples with overlapping valid-time intervals Their proach is to map intervals to a two-dimensional plane, which
ap-is then partitioned into regions The join result ap-is produced
by computing the subjoins of pairs of partitions ing to adjacent regions in the plane This method applies to
correspond-a restricted temporcorrespond-al model where future time is not correspond-allowed.They utilize a spatial index to speed up the joining phase
Trang 12Table 3 Existing algorithms and taxonomy counterparts
TEJ-1 Segev and Gunadhi Explicit/timestamp sort Surrogate attribute and 1TNF
TEJ-2 Segev and Gunadhi Explicit/timestamp sort None
EJ-2 Segev and Gunadhi Nested-loop Surrogate attribute and 1TNF
EJ-1 Segev and Gunadhi Explicit/timestamp sort Surrogate attribute and 1TNF
Time-join Segev and Gunadhi Timestamp sort None
TEJ-3 Segev and Gunadhi Timestamp sort None
TEJ-4 Segev and Gunadhi Timestamp sort Surrogate attribute/access chain
Several Leung and Muntz Timestamp sort None
Two Pfoser and Jensen Timestamp sort Partitioned relation; time-ordered updates
– Sitzmann and Stuckey Timestamp partition Requires preexisting temporal histogram
– Lu et al Timestamp partition Disallows future time; uses spatial index
3.4 Summary
We have surveyed temporal join algorithms and proposed
a taxonomy of such algorithms The taxonomy was
devel-oped by adapting well-established relational query evaluation
paradigms to the temporal operations
Table 3 summarizes how each temporal join operation
pro-posed in previous work is classified in the taxonomy We
be-lieve that the framework is complete since, disregarding
data-model-specific considerations, all previous work naturally fits
into one of the proposed categories
One important property of an algorithm is whether it
deliv-ers a partial answer before the entire input is read Among the
algorithms listed in Table 3, only the nested-loop algorithm
has this property Partition-based algorithms have to scan the
whole input relation to get the partitions Similarly, sort-based
algorithms have to read the entire input to sort the relation
We note, however, that it is possible to modify the temporal
sort-based algorithms to be nonblocking, using the approach
of progressive merge join [10].
4 Engineering the algorithms
As noted in the previous section, an adequate empirical
inves-tigation of the performance of temporal join algorithms has
not been performed We concentrate on the temporal equijoin,
defined in Sect 2.4 This join and the related temporal natural
join are needed to reconstruct normalized temporal relations
[25] To perform a study of implementations of this join, we
must first provide state-of-the-art implementations of the 19
different types of algorithms outlined for this join In this
sec-tion, we discuss our implementation choices
4.1 Nested-loop algorithm
We implemented a simple block-oriented nested-loop
algo-rithm Each block of the outer relation is read in turn into
memory The outer block is sorted by the explicit joining
at-tribute (actually, pointers are sorted to avoid copying of
tu-ples) Each block of the inner relation is then brought into
memory For a given inner block, each tuple in that block is
joined by binary searching the sorted tuples
This algorithm is simpler than the nested-loop algorithm,EJ-2, described in Sect 3.3.1 [20,47] In particular, our al-gorithm computes only the valid-time equijoin, while EJ-2computes the valid-time outerjoin, which includes the equi-join in the form of the valid-time natural join However, ouralgorithm supports a more general equijoin condition thanEJ-2 in that we support matching on any explicit attributerather than solely on a designated surrogate attribute
by generating many small, fully sorted runs, then repeatedly
merges these into increasingly longer runs until a single run
is obtained (this is done for the left-hand side and right-handside independently) Each step of the sort phase reads andwrites the entire relation The merge phase then scans the fullysorted left-hand and right-hand relations to produce the outputrelation A common optimization is to stop the sorting phaseone step early, when there is a small number of fully sortedruns The final step is done in parallel with the merge phase
of the join, thereby avoiding one read and one write scan.Our sort-merge algorithms implemented for the performanceanalysis are based on this optimization We generated initialruns using an in-memory quicksort on the explicit attributes(ES and ES-H), the timestamp attributes (TS and TS-H), orboth (ETS and ETS-H) and then merged the two relations onmultiple runs
4.2.2 Efficient skew handling
As noted in Sect 3.2.2, sort-merge join algorithms becomecomplicated when the join attributes are not key attributes
Our previous work on conventional joins [30] shows that
in-trinsic skew is generally present in this situation Even a small
amount of intrinsic skew can result in a significant mance hit because the naive approach to handling skew is to
Trang 13reread the previous tuples in the same value packet (containing
the identical values for the equijoin attribute); this rereading
in-volves additional I/O operations We previously proposed
sev-eral techniques to handle skew efficiently [30] Among them,
SC-n (spooled cache on multiple runs) was recommended due
to its strikingly better performance in the presence of skew for
both conventional and band joins This algorithm also exhibits
virtually identical performance as a traditional sort-merge join
in the absence of skew SC-n uses a small cache to hold the
skewed tuples from the right-hand relation that satisfy the join
condition At the cache’s overflow point, the cache data are
spooled to disk
Skew is prevalent in temporal joins SC-n can be adapted
for temporal joins by adding a supplemental predicate
(re-quiring that the tuples overlap) and calculating the resulting
timestamps, by intersection We adopt this spooled cache in ES
instead of rereading the previous tuples The advantage of
us-ing spooled cache is shown in Fig 8 ES Reread is the multirun
version of the explicitSortMerge algorithm exhibited
in Sect 3.2.2, which backs up the right-hand relation when a
duplicate value is found in the left-hand relation
The two algorithms were executed in theTimeIt
sys-tem The parameters are the same as those that will be used
in Sect 5.1 In this experiment, the memory size was fixed at
8 MB and the cache size at 32 KB The relations were generated
with different percentages of smooth skew on the explicit
at-tribute A relation has 1% smooth skew when 1% of the tuples
in the relation have one duplicate value on the join attribute
and the remaining 98% of the tuples have no duplicates Since
the cache can hold the skewed tuples in memory, no additional
I/O is caused by backing up the relation The performance
im-provement of using a cache is approximately 25% when the
data have 50% smooth skew We thus use a spooled cache to
handle skew Spooling will generally not occur but is available
in case a large value packet is present
4.2.3 Time-varying value packets
and optimized prediction rule
ES utilizes a prediction rule to judge if skew is present (Recall
that skew occurs if the two tuples have the same join attribute
value.) The prediction rule works as follows When the last
tuple in the right-hand relation (RHR) buffer is visited, thelast tuple in the left-hand relation (LHR) buffer is checked todetermine if skew is present and the current RHR value packetneeds to be put into the cache
We also implemented an algorithm (TS) that sorts the inputrelations by start time rather than by the explicit join attribute.Here the RHR value packet associated with a specific LHRtuple is not composed of those RHR tuples with the samestart time but rather of those RHR tuples that overlap withthe interval of the LHR tuple Hence value packets are notdisjoint, and they grow and shrink as one scans the LHR In
particular, TS puts into the cache only those tuples that could
overlap in the future: the tuples that do not stop too early, that
is, before subsequent LHR tuples start For an individual LHRtuple, the RHR value packet starts with the first tuple that stopssometime during the LHR tuple’s interval and goes throughthe first RHR tuple that starts after the LHR tuple stops Valuepackets are also not totally ordered when sorting by start time.These considerations suggest that we change the predic-tion rule in TS When the RHR reaches a block boundary, themaximum stop time in the current value packet is comparedwith the start time of the last tuple in the LHR buffer If themaximum stop time of the RHR value packet is less than thelast start time of the LHR, none of the tuples in the value packetwill overlap with the subsequent LHR tuples Thus there is noneed to put them in the cache Otherwise, the value packet isscanned and only those tuples with a stop time greater thanthe last start time of the LHR are put into the cache, therebyminimizing the utilization of the cache and thus the possibility
or a value packet into the cache
To make our work complete, we also implemented TES,which sorts the input relations primarily by start time and sec-ondarily by the explicit attribute The logic of TES is exactlythe same as that of TS for the joining phase We expect theextra sorting by explicit attribute will not help to optimize thealgorithm but rather will simply increase the CPU time
4.2.4 Specialized cache purgingSince the cache size is small, it could be filled up if a valuepacket is very large or if several value packets accumulate inthe cache For the former, nothing but spooling the cache can
be done However, purging the cache periodically can avoidunnecessary cache spool for the latter and may result in fewerI/O operations
Purging the cache costs more in TS since the RHR valuepackets are not disjoint, while in ES they are disjoint both ineach run and in the cache The cache purging process in ESscans the cache from the beginning and stops whenever thefirst tuple that belongs to the current value packet is met But
in TS, this purging stage cannot stop until the whole cachehas been scanned because the tuples belonging to the currentvalue packet are spread across the cache An inner long-lived
Trang 14Fig 9 Performance improvement of using a heap in ES
tuple could be kept in the cache for a long time because its
time interval could intersect with many LHR tuples
4.2.5 Using a heap
As stated in Sect 4.2.1, the final step of sorting is done in
parallel with the merging stage Assuming the two relations are
sorted in ascending order, in the merging stage the algorithm
first has to find the smallest value from the multiple sorted runs
of each relation and then compare the two values to see if they
can be joined The simplest way to find the smallest value is
to scan the current value of each run If the relation is divided
into m runs, the cost of selecting the smallest value is O(m).
A more efficient way to do this is to use a heap to select the
smallest value The cost of using a heap is O (log2m) when
At the beginning of the merging step, the heap is built based
on the value of the first tuple in each run Whenever advance
is called, the run currently on the top of the heap advances its
reading pointer to the next tuple Since the key value of this
tuple is no less than the tuple in the current state, it should
be propagated down to maintain the heap structure When a
run is backed up, its reading pointer is restored to point to a
previously visited tuple, which has a smaller key value, and
thus should be propagated up the heap
When the memory size is relatively small, which indicates
that the size of each run is small and therefore that a relation
has to be divided into more runs (the number of runs m is
large), the performance of using a heap will be much better
than that without a heap However, using a heap causes some
pointer swaps when sifting down or propagating up a tuple in
the heap, which are not needed in the simple algorithm When
the memory size is sufficiently large, the performance of using
a heap will be close to or even worse than that of the simple
algorithm
Figure 9 shows the total CPU time of ES when using and
not using a heap The data used in Fig 9 are two 64-MB
re-lations The input relations are joined while using different
sizes of memory Note that the CPU time includes the time of
both the sorting step and the merging step As expected, the
performance of using a heap is better than that without a heap
when the memory is small The performance improvement
is roughly 40% when the memory size is 2 MB The mance difference decreases as the memory increases Whenthe memory size is greater than 32 MB, which is one half of therelation size, using a heap has no benefit Since using a heapsignificantly improves the performance when the memory isrelatively small and barely degrades performance when thememory is large, we use a heap in all sort-based algorithms
perfor-4.2.6 GRACE and hybrid variants
We implemented both GRACE and hybrid versions of eachsort-based algorithm In the GRACE variants, all the sortedruns of a relation are written to disk before the merging stage.The hybrid variants keep most of the last run of the outerrelation in memory This guarantees that one (multiblock) diskread and one disk write of the memory-resident part will besaved When the available memory is slightly smaller than thedataset, the hybrid algorithms will require relatively fewer I/Ooperations
4.2.7 Adapting the interval join
We consider the interval join a variant of the timestamp merge algorithm (TS) In this paper, we call it TSI and itshybrid variant TSI-H To be fair, we do not assume the inputrelations are sorted into one list Instead, TSI begins with sort-ing as its first step Then it combines the last step of the sortwith the merge step The two active lists are essentially twospooled caches, one for each relation Each cache has the samesize as that in TS This is different from the strategy of keeping
sort-a single block of esort-ach list in the originsort-al psort-aper A smsort-all csort-achecan save more memory for the input buffer, thus reducing therandom reads However, it will cause more cache spools whenskew is present Since timestamp algorithms tend to encounterskew, we choose a cache size that is the same as that in TS,rather than one block
4.3 Partition-based algorithms
Several engineering considerations also occur when menting the partition-based algorithms
imple-4.3.1 Partitioning detailsThe details of algorithm TP are described elsewhere [54] Wechanged TP to use a slightly larger input buffer (32 KB) and acache for the inner relation (also 32 KB) instead of using a one-page buffer and cache The rest of the available main memory
is used for the outer relation There is a tradeoff between alarge outer input buffer and a large inner input buffer andcache A large outer input buffer implies a large partition size,which results in fewer seeks for both relations But the cache
is more likely to spool On the other hand, allocating a largecache and a large inner input buffer results in a smaller outerinput buffer, thus a smaller partition size This will increaserandom I/O We chose 32 KB instead of 1 KB (the page size)
Trang 15as a compromise The identification of the best cache size is
given in Sect 6 as a direction of future research
The algorithms ETP and TEP partition the input relations
in two steps ETP partitions the relations by explicit attribute
first For each pair of the buckets to be joined, if none of them
fits in memory, a further partition by timestamp attribute will
be made to these buckets to increase the possibility that the
resulting buckets do not overflow the available buffer space
TEP is similar to ETP, except that it partitions the relations in
the reverse order, first by timestamp and then, if necessary, by
explicit attribute
4.3.2 Joining the partitions
The partition-based algorithms perform their second phase,
the joining of corresponding partitions of the outer and inner
relations, as follows The outer partition is fetched into
mem-ory, assuming that it will not overflow the available buffer
space, and pointers to the outer tuples are sorted using an
in-memory quicksort The inner partition is then scanned, using
all memory pages not occupied by the outer partition For each
inner tuple, matching outer tuples are found by binary search
If the outer partitions overflow the available buffer space, then
the algorithms default to an explicit attribute sort-merge join
of the corresponding partitions
4.3.3 GRACE and hybrid variants
In addition to the conventional GRACE algorithm, we
imple-mented the hybrid buffer management for each partition-based
algorithm In the hybrid algorithms, one outer bucket is
des-ignated as memory-resident Its buffer space is increased
ac-cordingly to hold the whole bucket in memory When the inner
relation is partitioned, the inner tuples that map to the
corre-sponding bucket are joined with the tuples in the
memory-resident bucket This eliminates the I/O operations to write
and read one bucket of tuples for both the inner and the outer
relation Similar to the hybrid sort-based algorithms, the
hy-brid partition-based algorithms are supposed to have better
performance when the input relation is slightly larger than the
available memory size
4.4 Supporting the iterator interface
Most commercial systems implement the relational operators
as iterators [18] In this model, each operator is realized by
three procedures called open, next, and close The algorithms
we investigate in this paper can be redesigned to support the
iterator interface
The nested-loop algorithm and the explicit partitioning
al-gorithms are essentially the corresponding snapshot join
algo-rithms except that a supplemental predicate (requiring that the
tuples overlap) and the calculation of the resulting timestamps
are added in the next procedure.
The timestamp partitioning algorithms determine the
peri-ods for partitions by sampling the outer relation and partition
the input relations in the open procedure The next procedure
calls the next procedure of nested-loop join for each pair of
partitions An additional predicate is added in the next
proce-dure to determine if a tuple should be put into the cache.The sort-based algorithms generate the initial sorted runsfor the input relations and merge runs until the final merge
step is left in the open procedure In the next procedure, the
inner runs, the cache, and the outer runs are scanned to find amatch At the same time, the inner tuple is examined to decide
whether to put it in the cache The close procedure destroys the input runs and deallocates the cache The open and close
procedures of interval join algorithms are the same as the other
sort-based algorithms The next procedure gets the next tuple
from the sorted runs and scans the cache to find the matchingtuple and purges the cache at the same time
5 Performance
We implemented all 19 algorithms enumerated in Table 2 andtested their performance under a variety of data distributions,including skewed explicit and timestamp distributions, time-stamp durations, memory allocations, and database sizes Weensured that all algorithms generated exactly the same outputtuples in all of the experiments (the ordering of the tuples willdiffer)
The remainder of this section is organized as follows Wefirst give details on the join algorithms used in the experimentsand then describe the parameters used in the experiments Sec-tions 5.2 to 5.9 contain the actual results of the experiments.Section 5.10 summarizes the results of the experiments
5.1 Experimental setup
The experiments were developed and executed using theTimeIT [17] system, a software package supportingthe prototyping of temporal database components UsingTimeIT, we fixed several parameters describing all test re-lations used in the experiments These parameters and theirvalues are shown in Table 4 In all experiments, tuples were
16 bytes long and consisted of two explicit attributes, bothbeing integers and occupying 4 bytes, and two integer times-tamps, each also requiring 4 bytes Only one of the explicitattributes was used as the joining attribute This yields resulttuples that are 24 bytes long, consisting of 16 bytes of explicitattributes from each input tuple and 8 bytes for the timestamps
We fixed the relation size at 64 MB, giving four milliontuples per relation We were less interested in absolute relationsize than in the ratio of input size to available main memory.Similarly, the ratio of the page size to the main memory sizeand the relation size is more relevant than the absolute pagesize A scaling of these factors would provide similar results
In all cases, the generated relations were randomly orderedwith respect to both their explicit and timestamp attributes.The metrics used for all experiments are listed in Table 5 In
a modern computer system, a random disk access takes about
10 ms, whereas accessing a main memory location typicallytakes less than 60 ns [42] It is reasonable that a sequentialI/O takes about one tenth the time of a random I/O Moderncomputer systems usually have hardware data cache, whichcan reduce the CPU time on cache hit Therefore, we chosethe join attribute compare time as 20 ns, which was slightly less
Trang 16Table 4 System characteristics
Tuples per relation 4 million
Timestamp size ([s,e]) 8 bytes
Explicit attribute size 8 bytes
Relation lifespan 1,000,000 chronons
Output buffer size 32 KB
Cache size in sort-merge 64 KB
Cache size in partitioning 32 KB
Table 5 Cost metrics
Sequential I/O cost 1 ms
Random I/O cost 10 ms
Join attribute compare 20 ns
Timestamp compare 20 ns
Pointer compare 20 ns
Pointer swap 60 ns
than, while in the same magnitude of, the memory access time
The cost metrics we used is the average memory access time
given a high hit ratio (>90%) of cache It is possible that the
CPU cache has lower hit ratio when running some algorithms
However, the magnitude of the memory access time will not
change We assumed that the sizes of both a timestamp and
a pointer were the same as the size of an integer Thus, their
compare times are the same as that of the join attribute A
pointer swap takes three times as long as the pointer compare
time because it needs to access three pointers A tuple move
takes four times as long as the integer compare time since the
size of a tuple is four times that of an integer
We measured both main memory operations and disk I/O
operations To eliminate any undesired system effects from
the results, all operations were counted using facilities
pro-vided byTimeIT For disk operations, random and
sequen-tial access was measured separately We included the cost of
writing the output relation in the experiments since sort-based
and partition-based algorithms exhibit dual random and
se-quential I/O patterns when sorting/coalescing and
partition-ing/merging The total time was then computed by weighing
each parameter by the time values listed in Table 5
Table 6 summarizes the values of the system parameters
that varied among the different experiments Each row of the
table identifies the figures that illustrate the results of the
ex-periments given the parameters for the experiment The reader
may have the impression that the intervals are so small that
they are almost like the standard equijoin atributes Are there
tuples overlapping each other? In many cases, we performed
the self-join, which guaranteed for each tuple in one relation
that there is at least one matching tuple in the other relation
Long-duration timestamps (100 chronons) were used in two
experiments It guaranteed that there were on average four
tu-ples valid in each chronon Two other experiments examine
the case where one relation has short-duration timestamps and
the other has long-duration timestamps Therefore, our iments actually examined different degrees of overlapping
exper-5.2 Simple experiments
In this section, we perform three “base case” experiments,where the join selectivity is low, i.e., for an equijoin of valid-
time relations r and s, a given tuple x ∈ r joins with one, or
few, tuples y ∈ s The experiments incorporate random data
distributions in the explicit join attributes and short and longtime intervals in the timestamp attributes
5.2.1 Low explicit selectivity with short timestamps
In this experiment, we generated a relation with little explicitmatching and little overlap and joined the relation with itself.This mimics a foreign key-primary key natural join in that thecardinality of the result is the same as one of the input rela-tions The relation size was fixed at 64 MB, corresponding tofour million tuples The explicit joining attribute values wereintegers drawn from a space of231− 1 values For the given
cardinality, a particular explicit attribute value appeared on erage in only one tuple in the relation The starting timestampattribute values were randomly distributed over the relationlifespan, and the duration of the interval associated with eachtuple was set to one chronon We ran each of the 19 algorithmsusing the generated relation, increasing main memory alloca-tions from 2 MB, a 1:32 memory to input size ratio, to 64 MB,
av-a 1:1 rav-atio
The results of the experiment are shown in Fig 10 In eachpanel, the ordering of the legend corresponds to the order ofeither the rightmost points or the leftmost points of each curve.The actual values of each curve in all the figures may be found
in the appendix of the associated technical report [16] Note
that both the x-axis and the y-axis are log-scaled.As suspected,
nested loop is clearly not competitive The general nested-loopalgorithm performs very poorly in all cases but the highestmemory allocation At the smallest memory allocation, theleast expensive algorithm, EP, enjoys an 88% performanceincrease Only at the highest memory allocation, that is, whenthe entire left-hand side relation can fit in main memory, doesthe nested-loop algorithm have comparable performance withother algorithms Given the disparity in performance and giventhat various characteristics, such as skew or the presence oflong-duration tuples, do not impact the performance of thenested-loop algorithm, we will not consider this algorithm inthe remainder of this section
To get a better picture of the performance of the remainingalgorithms, we plot them separately in Fig 11 From this figure
on, we eliminate the noncompetitive nested loop We groupthe algorithms that have a similar performance and retain only
a representative curve for each group in the figures In thisfigure, TES-H and TSI-H have performances very similar tothat of TS-H; ETS-H has a performance very similar to that ofES-H; ETP-H, TP-H, and TEP-H have performances similar tothat of EP-H; the remaining algorithms all have a performancesimilar to that of EP
In this graph, only the x-axis is log-scaled The sort-based
and partition-based algorithms exhibit largely the same formance, and the hybrid algorithms outperform their GRACE
Trang 17per-Table 6 Experiment parameters
Fig 10 Low explicit selectivity,
low timestamp selectivity
counterparts at high memory allocations, in this case when the
ratio of main memory to input size reaches approximately 1:8
(2 MB of main memory) or 1:4 (4 MB of main memory) The
poor performance of the hybrid algorithms stems from
reserv-ing buffer space to hold the resident run/partition, which takes
buffer space away from the remaining runs/partitions, causing
the algorithms to incur more random I/O At small memory
allocations, the problem is acute Therefore, the hybrid group
starts from a higher position and ends in a lower position The
GRACE group behaves in the opposite way
The performance differences between the sort-based
algo-rithms and their partitioning counterparts are small, and there
is no absolute winner TES, the sort-merge algorithm that sorts
the input relation primarily by start time and secondarily by
explicit attribute, has a slightly worse performance than TS,
which sorts the input relation by start time only Since the order
of the start time is not the order of the time interval, the extra
sorting by explicit attribute does not help in the merging step.The program logic is the same as for TS, except for the extrasorting We expect TES will always perform a little worse than
TS Therefore, neither TES nor TES-H will be considered inthe remainder of this section
5.2.2 Long-duration timestamps
In the experiment described in the previous section, the joinselectivity was low since explicit attribute values were sharedamong few tuples and tuples were timestamped with intervals
of short duration We repeated the experiment using duration timestamps The duration of each tuple timestampwas fixed at 100 chronons, and the starting timestamps wererandomly distributed throughout the relation lifespan As be-fore, the explicit join attribute values were randomly dis-
Trang 18Fig 11 Low explicit selectivity, low timestamp selectivity (without
Fig 12 Low explicit selectivity (long-duration timestamps)
tributed integers; thus the size of the result was just slightly
larger due to the long-duration timestamps
The results are shown in Fig 12, where the x-axis is
log-scaled In this figure, the group of ES-H and ETS-H are
repre-sented by ES-H; the group of ETP-H, EP-H, TEP-H, and TP-H
by TP-H; the group of TP, TEP, ES, ETS, EP, and ETP by ES;
and the rest are retained The timestamp sorting algorithms,
TS and TS-H, suffer badly Here, the long duration of the tuple
lifespans did not cause overflow of the tuple cache used in these
algorithms To see this, recall that our input relation cardinality
was four million tuples For a 1,000,000 chronon relation
lifes-pan, this implies that4, 000, 000/1, 000, 000 = 4 tuples arrive
per chronon Since tuple lifespans were fixed at 100 chronons,
it follows that4 × 100 = 400 tuples should be scanned before
any purging of the tuple cache can occur However, a 64-KB
tuple cache, capable of holding 4000 tuples, does not tend to
overflow Detailed examination verified that the cache never
overflowed in these experiments The poor performance of TS
and TS-H are caused by the repeated in-memory processing
of the long-lived tuples
TSI and TSI-H also suffer in the case of long duration but
are better than TS and TS-H when the main memory size is
small TSI improves the performance of TS by 32% at the
smallest memory allocation, while TSI-H improves the formance of TS-H by 13% Our detailed results show that the
per-TS had slightly less I/O time than per-TSI per-TS also saved sometime in tuple moving since it did not move every tuple intocache However, it spent much more time in timestamp com-paring and pointer moving In TSI, each tuple only joined withthe tuples in the cache of the other relation The caches in TSIwere purged during the join process; thus the number of time-stamp comparisons needed by the next tuple was reduced In
TS, an outer tuple joined with both cache tuples and tuples
in the input buffer of the inner relation, and the input bufferwas never purged Therefore, TS had to compare more times-tamps Pointer moving is needed in the heap maintenance,which is used to sort the current tuples in each run TS fre-quently backed up the inner runs inside the inner buffer andscanned tuples in the value packets multiple times In eachscan, the heap for the inner runs had to sort the current innertuples again In TSI, the tuples are sorted once and kept in or-der in the caches Therefore, the heap overhead is small Whenthe main memory size is small, the number of runs is large, asare the heap size and the heap overhead
The timestamp partitioning algorithms, TP and TP-H,have a performance very similar to that described inSect 5.2.1 There are two main causes of the good perfor-mance of TP and TP-H The first is that TP does not replicatelong-lived tuples that overlap with multiple partition intervals.Otherwise, TP would need more I/O for the replicated tuples.The second is that TP sorts each partition by the explicit at-tribute The long duration does not have any effect on the per-formance of the in-memory joining All the other algorithmssort or partition the relations by explicit attributes Therefore,their performance is not affected by the long duration
We may conclude from this experiment that the timestampsort-based algorithms are quite sensitive to the durations ofinput tuple intervals When tuple durations are long, the in-memory join in TS and TS-H performs poorly due to the need
to repeatedly back up the tuple pointers
5.2.3 Short- and long-duration timestamps
In the experiments described in the previous two sections, thetimestamps are either short or long for both relations We ex-amined the case where the durations for the two input relationsare different The duration of each tuple timestamp in the outerrelation was fixed at 1 chronon, while the duration of that inthe inner relation was fixed at 100 chronons We carefully gen-erated the two relations so that the outer relation and the innerrelation had a one-to-one relationship For each tuple in theouter relation, there is one tuple in the inner relation that hasthe same value of the explicit attributes and the same start time
as the outer tuple with a long duration instead of a short ration This guaranteed that the selectivity was between that
du-of the two previous experiments As before, the explicit joinattribute values and the start time were randomly distributed
The results are shown in Fig 13, where the x-axis is
log-scaled The groups of the curves are the same as in Fig 12.The relative positions of the curves are similar to those in thelong-duration experiment The performance of the timestampsorting algorithms were even worse than that of the others,but better than that in the experiment where long-duration tu-
Trang 19Fig 13 Low explicit selectivity (short-duration timestamps join
long-duration timestamps)
ples were in both input relations Long-duration tuples reduce
the size of value packets for each tuple on only one side and
therefore result in fewer timestamp comparisons in all four
timestamp sorting algorithms and fewer backups in TS and
TS-H
We also exchanged the outer and inner relations for this
experiment and observed results identical to those in Fig 13
This indicates that whether the long-duration tuples exist in
the outer relation or the inner relation has little impact on the
performance of any algorithm
5.3 Varying relation sizes
It has been shown for snapshot join algorithms that the
rela-tive sizes of the input relations can greatly affect which
sort-or partition-based strategy is best [18] We investigated this
phenomenon in the context of valid-time databases
We generated a series of relations, increasing in size from
4 MB to 64 MB, and joined them with a 64-MB relation The
memory allocation used in all trials was 16 MB, the size at
which all algorithms performed most closely in Fig 11 As
in the previous experiments, the explicit join attribute
val-ues in all relations were randomly distributed integers
Short-duration timestamps were used to mitigate the in-memory
ef-fects on TS and TS-H seen in Fig 12 As before, starting
timestamps were randomly distributed over the relation
lifes-pan Since the nested-loop algorithm is expected to be a
com-petitor when one of the relations fits in the memory, we
incor-porated this algorithm into this experiment The results of the
experiment are shown in Fig 14 In this figure, ES represents
all the GRACE sorting algorithms, ES-H all the hybrid sorting
algorithms, EP all the GRACE partitioning algorithms, TP-H
the hybrid timestamp partitioning algorithms, and EP-H the
hybrid explicit partitioning algorithms, and NL is retained
The impact of a differential in relation sizes for the
partition-based algorithms is clear When an input relation is
small relative to the available main memory, the
partition-based algorithms use this relation as the outer relation and
build an in-memory partition table from it The inner relation is
then linearly scanned, and for each inner tuple the in-memory
100 200 300 400 500 600 700
outer relation size (MB)
ES ES-H EP TP-H EP-H
Fig 14 Different relation sizes (short-duration timestamps)
partition table is probed for matching outer tuples The benefit
of this approach is that each relation is read only once, i.e., nointermediate writing and reading of generated partitions oc-curs Indeed, the inner relation is not partitioned at all, furtherreducing main memory costs in addition to I/O savings.The nested-loop algorithm has the same I/O costs aspartition-based algorithms when one of the input relations fits
in the main memory When the size of the smaller input tion is twice as large as the memory size, the performance ofnested-loop algorithms is worse than that of any other algo-rithms This is consistent with the results shown in Fig 10
rela-An important point to note is that this strategy is ficial regardless of the distribution of either the explicit joinattributes and/or the timestamp attributes, i.e., it is unaffected
bene-by either explicit or timestamp skew Furthermore, no similaroptimization is available for sort-based algorithms Since eachinput relation must be sorted, both relations must be read andwritten once to generate sorted runs and subsequently readonce to scan and match joining tuples
To further investigate the effectiveness of this strategy,
we repeated the experiment of Fig 14 with long-durationtimestamps, i.e., tuples were timestamped with timestamps
100 chronons in duration We did not include the nested-loopalgorithm because we did not expect the long-duration tuples
to have any impact on it The results are shown in Fig 15 Thegrouping of the curves in this figure is slightly different fromthe grouping in Fig 14 in that timestamp sorting algorithmsare separated instead of grouped together
As expected, long-duration timestamps adversely affectthe performance of all the timestamp sorting algorithms forreasons stated in Sect 5.2.2 The performance of TSI and TSI-
H is slightly better than that of TS and TS-H, respectively.This is consistent with the results at 16 MB memory size inFig 12 Replotting the remaining algorithms in Fig 16 showsthat the long-duration timestamps do not significantly impactthe efficiency of other algorithms
In both the short-duration and the long-duration cases,the hybrid partitioning algorithms show the best performance.They save about half of the I/O operations of their GRACEcounterparts when the size of the outer relation is 16 MB This
is due to the hybrid strategy
Trang 20outer relation size (MB)
outer relation size (MB)
ES ES-H EP TP-H EP-H
Fig 16 Different relation sizes (long-duration timestamps, without
TS/TS-H)
We further changed the input relations so that the tuples in
the outer relations have the fixed short duration of 1 chronon
and those in the outer relations have the fixed long duration
of 100 chronons Other features of the input relations remain
the same The reults, as shown in Fig 17, are very similar to
the long-duration case The performance of timestamp sorting
algorithms is slightly better than that in Fig 15 Again, we
re-generated the relations such that the tuples in the outer relation
have the long-duration fixed at 100 chronons and those in the
inner relation have the short-duration fixed at 1 chronon The
results are almost identical to those shown in Fig 17
The graph shows that partition-based algorithms
should be chosen whenever the size of one or both of
the input relations is small relative to the available buffer
space We conjecture that the choice between explicit
par-titioning and timestamp parpar-titioning is largely dependent
on the presence or absence of skew in the explicit and/or
timestamp attributes Explicit and timestamp skew may or
may not increase I/O cost; however, they will increase main
memory searching costs for the corresponding algorithms, as
we now investigate
0 200 400 600 800
outer relation size (MB)
TS TSI TS-H TSI-H ES ES-H EP TP-H EP-H
Fig 17 Different relation sizes (short- and long-duration timestamps)
5.4 Explicit attribute skew
As in the experiments described in Sect 5.3, we fixed themain memory allocation at 16 MB to place all algorithms on
a nearly even footing The inner and outer relation sizes werefixed at 64 MB each We generated a series of outer relationswith increasing explicit attribute skew, from 0% to 100% in
20% increments Here we generated data with chunky skew.
The explicit attribute has 20% chunky skew, which indicatesthat 20% of the tuples in this relation have the same explicitattribute value Explicit skew was ensured by generating tu-ples with the same explicit join attribute value Short-durationtimestamps, randomly distributed over the relation lifespan,were used to mitigate the long-duration timestamp effect ontimestamp sorting algorithms The results are shown in Fig 18
In this figure, TSI, TS, TEP, and TP are represented by TS andtheir hybrid counterparts by TS-H, and other algorithms areretained
There are three points to emphasize in this graph First,the explicit partitioning algorithms, i.e., EP, EP-H, ETP, andETP-H, show increasing costs as the explicit skew increases.The performance of EP and EP-H degrades dramatically withincreasing explicit skew This is due to the overflowing of mainmemory partitions, causing subsequent buffer thrashing Theeffect, while pronounced, is relatively small since only one ofthe input relations is skewed Encountering skew in both rela-tions would exaggerate the effect Although the performance
of ETP and ETP-H also degrades, the changes are much lesspronounced The reason is that they employ time partitioning
to reduce the effect of explicit attribute skew
As expected, the group of algorithms that perform sorting
or partitioning on timestamps, TS, TS-H, TP, TP-H, TEP, andTEP-H, have relatively flat performance By ordering or parti-tioning by time, these algorithms avoid effects due to explicitattribute distributions
The explicit sorting algorithms, ES, ES-H, ETS, andETS-H, perform very well In fact, the performance of ESand ES-H increases as the skew increases As the skew in-creases, by default the relations become increasingly sorted.Hence, ES and ES-H expend less effort during run generation
We conclude from this experiment that if high explicitskew is present in one input relation, then explicit sorting,
Trang 21explicit skew in outer relation (percentage)
timestamp partitioning, and timestamp sorting appear to be the
better alternatives The choice among these is then dependent
on the distribution and the length of tuple timestamps, which
can increase the amount of timestamp skew present in the
input, as we will see in the next experiment
5.5 Timestamp skew
Like explicit attribute distributions, the distribution of
time-stamp attribute values can greatly impact the efficiency of the
different algorithms We now describe a study on the effect of
this aspect
As in the experiments described in Sect 5.3, we fixed
the main memory allocation at 16 MB and the sizes of all
input relations at 64 MB We fixed one relation with randomly
distributed explicit attributes and randomly distributed tuple
timestamps, and we generated a series of relations with
in-creasing timestamp attribute chunky skew, from 0% to 100%
in 20% increments The timestamp attribute has 20% chunky
skew, which indicates that 20% of the tuples in the relation
are in one value packet The skew was created by
generat-ing tuples with the same interval timestamp Short-duration
timestamps were used in all relations to mitigate the
long-duration timestamp effect on timestamp sorting algorithms
Explicit join attribute values were distributed randomly The
results of the experiment are shown in Fig 19 In this figure,
all the GRACE explicit algorithms are represented by EP,
hy-brid explicit sorting algorithms by ES-H, and hyhy-brid explicit
partition algorithms by EP-H; the remaining algorithms are
retained
Four interesting observations may be made First, as
ex-pected, the timestamp partitioning algorithms, i.e., TP, TEP,
TP-H, and TEP-H, suffered increasingly poorer performance
as the amount of timestamp skew increased This skew causes
overflowing partitions The performance all four of these gorithms is good when the skew is 100% because TP andTP-H become explicit sort-merge joins and TEP and TEP-Hbecome explicit partition joins Second, TSI and TSI-H alsoexhibited poor performance as the timestamp skew increasedbecause 20% skew in the outer relation caused the outer cache
al-to overflow Third, TS and TS-H show increased performance
at the highest skew percentage This is due to the sortedness
of the input, analogous to the behavior of ES and ES-H inthe previous experiment Finally, as expected, the remainingalgorithms have flat performance across all trials
When timestamp skew is present, timestamp partitioning
is a poor choice We expected this result, as it is analogous
to the behavior of partition-based algorithms in conventionaldatabases, and similar results have been reported for temporalcoalescing The interval join algorithms are also bad choiceswhen the amount of timestamp skew is large A small amount
of timestamp skew can be handled efficiently by increasingthe cache size in interval join algorithms We will discuss thisissue again in Sect 5.8 Therefore, the two main dangers togood performance are explicit attribute skew and/or timestampattribute skew We investigate the effects of simultaneous skewnext
5.6 Combined explicit/timestamp attribute skew
Again, we fixed the main memory allocation at 16 MB andset the input relation sizes at 64 MB Timestamp durationswere set to 1 chronon to mitigate the long-duration timestampeffect on the timestamp sorting algorithms We then generated
a series of relations with increasing explicit and timestampchunky skew, from 0% to 100% in 20% increments Skew wascreated by generating tuples with the same explicit joiningattribute value and tuple timestamp The explicit skew andthe timestamp skew are orthogonal The results are shown in
Trang 22timestamp skew in outer relation (percentage)
explicit and timestamp skew in outer relation (percentage)
Fig 20 In this figure, ETS, ES, and TS are represented by ES;
and ETS-H, ES-H, and TS-H by ES-H; the other algorithms
are retained
The algorithms are divided into three groups in terms of
performance As expected, most of the partition-based
al-gorithms and the interval join alal-gorithms, TEP, TEP-H, TP,
TP-H, EP, EP-H, TSI, and TSI-H, show increasingly poorer
performance as the explicit and timestamp skew increases The
remaining explicit/timestamp sorting algorithms show
rela-tively flat performance across all trials, and the explicit sorting
and timestamp sorting algorithms exhibit increasing
perfor-mance as the skew increases, analogous to their perforperfor-mance
in the experiments described in Sects 5.4 and 5.5 While theelapsed time of ETP and ETP-H increases slowly along withincreasing skew, these two algorithms perform very well This
is analogous to their performance in the experiments described
in Sect 5.4
5.7 Explicit attribute skew in both relations
In previous work [30], we studied the effect of data skew onthe performance of sort-merge joins There are three types ofskew: outer relation skew, inner relation skew, and dual skew
Trang 23explicit skew in both relations (percentage)
explicit skew in both relations (percentage)
ES
ES-H
EP
TS
Fig 22 Explicit attribute skew in both relations
Outer skew occurs when value packets in the outer relation
cross buffer boundaries Similarly, inner skew occurs when
value packets in the inner relation cross buffer boundaries
Dual skew indicates that outer skew occurs in conjunction with
inner skew While outer skew does not cause any problems for
TS and TS-H, it degrades the performance of TSI and
TSI-H; dual skew degrades the performance of the TS and TS-H
joins In this section, we compare the performance of the join
algorithms in the presence of dual skew in the explicit attribute
The main memory allocation was fixed at 16 MB and the
size of all input relations at 64 MB We generated a series
of relations with increasing explicit attribute chunky skew,
from 0% to 4% in 1% increments To ensure dual skew, we
performed self-join on these relations Short-duration
times-tamps, randomly distributed over the relation lifespan, were
used to mitigate the long-duration timestamp effect on the
timestamp sorting algorithms The results are shown in Fig 21
In this figure, all the explicit partitioning algorithms are
rep-resented by EP, all the timestamp partitioning algorithms by
TP, all the sort merge algorithms except ES and ES-H by TS,
and ES and ES-H are retained
There are three points to discuss regarding the
graph First, the explicit algorithms, i.e., ES, ES-H, EP,
EP-H, ETP, and ETP-H, suffer when the skew increases
Al-though the numbers of I/O operations of these algorithms crease along with the increasing skew, the I/O-incurred dif-ference between the highest and the lowest skew is only 2 s.The difference in the output relation size between the highestand the lowest skew is only 460 KB, which leads to about a4.6-s performance difference Then what is the real reason forthe performance hit of these algorithms? Detailed examinationrevealed that it is in-memory operations that cause the poorperformance of these algorithms When data skew is present,these algorithms have to do substantial in-memory work toperform the join This is illustrated in Fig 22, which showsthe CPU time used by each algorithm To present the differ-
in-ence clearly, we do not use a log-scale y-axis Note that six
algorithms, i.e., ETS, TSI, TS, ETS-H, TSI-H, and TS-H, havevery low CPU cost (less than 30 s) in all cases So their perfor-mance does not degrade when the degree of skew increases.Second, the performance of the timestamp partitioning al-gorithms, i.e., TP, TP-H, TEP, and TEP-H, degrade with in-creasing skew, but not as badly as do the explicit algorithms.Although timestamp partitioning algorithms sort each parti-tion by the explicit attribute, the explicit attribute inside each
partition is not highly skewed For example, if n tuples have
the same value as the explicit attribute, they will be put intoone partition after being hashed in EP In the join phase, there
will be an n × n loop within the join In TP, this value packet
will be distributed evenly across partitions Assuming there
are m partitions, each partition will have n/m of these tuples, which leads to an n2/m2loop within the join per partition.
The total number of join operations in TP will be n2/m, which
is1/m of that of EP This factor can be seen from Fig 22.
Finally, the timestamp sorting algorithms, i.e., TS,
TS-H, TSI, TSI-TS-H, ETS, and ETS-TS-H, perform very wellunder explicit skew TS and TS-H only use the time-stamp to determine if a backup is needed TSI andTSI-H only use the timestamp to determine if the cache tu-ples should be removed We see the benefit of the secondarysorting on the timestamp in the algorithms ETS and ETS-H.Since these two algorithms define the value packet by both theexplicit attribute and the timestamp, the big loop in the joinphase is avoided
From this experiment, we conclude that when explicit dualskew is present, all the explicit algorithms are poor choicesexcept for ETS and ETS-H The effects of timestamp dualskew are examined next
5.8 Timestamp dual skew
Like explicit dual skew, timestamp dual skew can affect theperformance of the timestamp sort-merge join algorithms Welook into this effect
We fixed main memory at 16 MB and input relations at
64 MB We generated a series of relations with increasingtimestamp chunky skew, from 0% to 4% in 1% increments
To ensure dual skew, we performed a self-join on these tions Short-duration timestamps, randomly distributed overthe relation lifespan, were used to mitigate the long-durationtimestamp effect on timestamp sorting algorithms The ex-plicit attribute values were also distributed randomly The re-sults are shown in Fig 23 In this figure, GRACE explicit sortmerge algorithms are represented by ES; all hybrid partition-
Trang 24timestamp skew in both relations (percentage)
timestamp skew in both relations (percentage)
Fig 24 Timestamp attribute skew in both relations
ing algorithms by EP-H, TSI and TSI-H by TSI, and TEP, TP,
ETP, EP, ETS-H, and ES-H by EP; the remaining algorithms
are retained
The algorithms fall into three groups All the timestamp
sort-merge algorithms exhibit poor performance However, the
performance of TS and TS-H is much better than that of TSI
and TSI-H At the highest skew, the performance of TS is
174 times better than that of TSI This is due to the cache
overflow in TSI One percent of 64 MB is 640 KB, which is
ten times the cache size The interval join algorithm scans and
purges the cache once for every tuple to be joined The cache
thrashing occurs when the cache overflows As before, there
is no cache overflow in TS and TS-H The performance gap
between these two algorithms and the group with flat curves is
caused by in-memory join operations The CPU time used by
each algorithm is plotted separately in Fig 24 In this figure,
all the explicit sort-merge algorithms are represented by ES,
all the explicit partitioning algorithms by EP, all the timestamp
partitioning algorithms by TP, and TSI and TSI-H by TSI ; the
remaining algorithms are retained Since all but timestamp
sort-merge algorithms perform the in-memory join by sorting
the relations or the partitions on the explicit attribute, their
performance is not at all affected by dual skew
400 600 1000 2000 5000 10000 50000 200000 500000 1100000
explicity/timestamp skew in both relations (percentage)
TSI ES ES-H EP
Fig 25 Explicit/timestamp attribute skew in both relations
It is interesting that the CPU time spent by TSI is less thanthat spent by TS The poor overall performance of TSI due tothe cache overflow can be improved by increasing the cachesize of TSI Actually, TSI performs the join operation in thetwo caches rather than in the input buffers Therefore, a largecache size can be chosen when dual skew is present to avoidcache thrashing In this case, a 1-MB cache size for TSI willresult in a performance similar to that of TS
5.9 Explicit/timestamp dual skew
In this section, we investigate the simultaneous effect of dualskew in both the explicit attribute and the timestamp This is
a challenging situation for any temporal join algorithm.The memory size is 16 MB, and we generated a series
of 64-MB relations with increasing explicit and timestampchunky skew, from 0% to 4% in 1% increments Dual skewwas guaranteed by performing a self-join on these relations.The results are shown in Fig 25 In this figure, TSI and TSI-Hare represented by TSI, TS and ES by ES, TS-H and ES-H byTS-H, all the explicit partitioning algorithms by EP, and theremaining algorithms by TP
The interesting point is that all the algorithms are affected
by the simultaneous dual skew in both the explicit and stamp attributes But they fall into two groups The algorithmsthat are sensitive to the dual skew in either explicit attribute
time-or timestamp attribute perftime-orm as badly as they do in the periments described in Sects 5.7 and 5.8 The performance
of the algorithms not affected by the dual skew in either plicit attribute or timestamp attribute degrades with increasingskew However, their performance is better than that of the al-gorithms in the first group This is due to the orthogonality ofthe explicit skew and the timestamp skew
ex-5.10 Summary
The performance study described in this section is the firstcomprehensive, empirical analysis of temporal join algo-rithms We investigated the performance of 19 non-index-based join algorithms, namely, nested-loop (NL), explicit
Trang 25partitioning (EP and EP-H), explicit sorting (ES and
ES-H), timestamp sorting (TS and TS-ES-H), interval join (TSI
and TSI-H), timestamp partitioning (TP and TP-H),
com-bined explicit/timestamp sorting (ETS and ETS-H) and
time-stamp/explicit sorting (TES and TES-H), and combined
ex-plicit/timestamp partitioning (ETP and ETP-H) and
time-stamp/explicit partitioning (TEP and TEP-H) for the temporal
equijoin We varied the following main aspects in the
experi-ments: the presence of long-duration timestamps, the relative
sizes of the input relations, and the explicit-join and timestamp
attribute distributions
The findings of this empirical analysis can be summarized
as follows
• The algorithms need to be engineered well to avoid
perfor-mance hits Care needs to be taken in sorting, in purging
the cache, in selecting the next tuple in the merge step, in
allocating memory, and in handling intrinsic skew
• Nested-loop is not competitive.
• The timestamp sorting algorithms, TS, TS-H, TES,
TES-H, TSI, and TSI-H, were also not competitive They
were quite sensitive to the duration of input tuple
times-tamps TSI and TSI-H had very poor performance in the
presence of large amounts of skew due to cache overflow
• The GRACE variants were competitive only when there
was low selectivity and a large memory size relative to the
size of the input relations In all other cases, the hybrid
variants performed better
• In the absence of explicit and timestamp skew, our results
parallel those from conventional query evaluation In
par-ticular, when attribute distributions are random, all
sort-ing and partitionsort-ing algorithms (other than those already
eliminated as noncompetitive) have nearly equivalent
per-formance, irrespective of the particular attribute type used
for sorting or partitioning
• In contrast with previous results in temporal coalescing
[5], the binary nature of the valid-time equijoin allows
an important optimization for partition-based algorithms
When one input relation is small relative to the available
main memory buffer space, the partitioning algorithms
have uniformly better performance than their sort-based
counterparts
• The choice of timestamp or explicit partitioning depends
on the presence or absence of skew in either attribute
di-mension Interestingly, the performance differences are
dominated by main memory effects The timestamp
parti-tioning algorithms were less affected by increasing skew
• ES and ES-H were sensitive to explicit dual skew.
• The performance of the partition-based algorithms, EP and
EP-H, was affected by both outer and dual explicit attribute
skew
• The performance of TP and TP-H degraded when
outer skew was present Except for this one
situ-ation, these partition-based algorithms are generally
more efficient than their sort-based counterparts since
sorting, and associated main memory operations, are
avoided
• It is interesting that the combined
explicit/timestamp-based algorithms can mitigate the effect of either explicit
attribute skew or timestamp skew However, when dual
skew was present in the explicit attribute and the timestamp
simultaneously, the performance of all the algorithms graded, though again less so for timestamp partitioning
de-6 Conclusions and research directions
As a prelude to investigating non-index-based temporal joinevaluation, this paper initially surveyed previous work, firstdescribing the different temporal join operations proposed inthe past and then describing join algorithms proposed in previ-ous work The paper then developed evaluation strategies forthe valid-time equijoin and compared the evaluation strategies
in a sequence of empirical performance studies The specificcontributions are as follows
• We defined a taxonomy of all temporal join operators
pro-posed in previous research The taxonomy is a natural one
in the sense that it classifies the temporal join operators asextensions of conventional operators, irrespective of spe-cial joining attributes or other model-specific restrictions.The taxonomy is thus model independent and assigns aname to each temporal operator consistent with its exten-sion of a conventional operator
• We extended the three main paradigms of query
evalua-tion algorithms to temporal databases, thereby defining thespace of possible temporal evaluation algorithms
• Using the taxonomy of temporal join algorithms, we
de-fined 19 temporal equijoin algorithms, representing thespace of all such possible algorithms, and placed all exist-ing work into this framework
• We defined the space of database parameters that affect the
performance of the various join algorithms This space ischaracterized by the distribution of the explicit and time-stamp attributes in the input relation, the duration of times-tamps in the input relations, the amount of main memoryavailable to the join algorithm, the relative sizes of theinput relations, and the amount of dual attribute and/ortimestamp skew for each of the relations
• We empirically compared the performance of the
algo-rithms over this parameter space
Our empirical study showed that some algorithms can beeliminated from further consideration: NL, TS, TS-H, TES,TES-H, ES, ES-H, EP, and EP-H Hybrid variants generallydominated GRACE variants, eliminating ETP, TEP, and TP.When the relation sizes were different, explicit sorting (ETS,ETS-H, ES, ES-H) performed poorly
This leaves three algorithms, all partitioning ones: ETP-H,TEP-H, TP-H Each dominates the other two in certain circum-stances, but TP-H performs poorly in the presence of time-stamp and attribute skew and is significantly more compli-cated to implement Of the other two, ETP-H came out aheadmore often than TEP-H Thus we recommend ETP-H, a hybridvariant of explicit partitioning that partitions primarily by theexplicit attribute If this attribute is skewed so that some buck-ets do not fit in memory, a further partition on the timestampattribute increases the possibility that the resulting buckets will
fit in the available buffer space
The salient point of this study is that simple modifications
to an existing conventional evaluation algorithm (EP) can beused to effect temporal joins with acceptable performance and
at relatively small development cost While novel algorithms
Trang 26(such as TP-H) may have better performance in certain
circum-stances, well-understood technology can be easily adapted and
will perform acceptably in many situations Hence database
vendors wishing to implement temporal join may do so with
a relatively low development cost and still achieve acceptable
performance
The above conclusion focuses on independent join
opera-tions rather than a query consisting of several algebraic
oper-ations Given the correlation between various operations, the
latter is more complex For example, one advantage of
sort-merge algorithms is that the output is also sorted, which can
be exploited in subsequent operations This interesting order
is used in traditional query optimization to reduce the cost of
the whole query We believe temporal query optimization can
also take advantage of this [50] Among the sort-merge
al-gorithms we have examined, the output of explicit alal-gorithms
(ES, ES-H, ETS, ETS-H) is sorted by the explicit join attribute;
interval join algorithms produce the output sorted by the start
timestamp Of these six algorithms, we recommend ETS-H
due to its higher efficiency
Several directions for future work exist Important
prob-lems remain to be addressed in temporal query processing, in
particular with respect to temporal query optimization While
several researchers have investigated algebraic query
opti-mization, little research has appeared with respect to
cost-based temporal query optimization
In relation to query evaluation, additional investigation of
the algorithm space described in Sect 5 is needed Many
op-timizations originally developed for conventional databases,
such as read-ahead and write-behind buffering, forecasting,
eager and lazy evaluation, and hash filtering, should be
ap-plied and investigated Cache size and input buffer allocation
tuning is also an interesting issue
All of our partitioning algorithms generate maximal
parti-tions, that of the main memory size minus a few blocks for the
left-hand relation of the join, and then apply that partitioning
to the right-hand relation In the join step, a full left-hand
par-tition is brought into main memory and joined with successive
blocks from the associated right-hand partition Sitzmann and
Stuckey term this a static buffer allocation strategy and
in-stead advocate a dynamic buffer allocation strategy in which
the left-hand and right-hand relations are partitioned in one
step, so that two partitions, one from each relation, can
si-multaneously fit in the main memory buffer [49] The
advan-tage over the static strategy is that fewer seeks are required
to read the right-hand side partition; the disadvantage is that
this strategy results in smaller, and thus more numerous,
par-titions, which increases the number of seeks and requires that
the right-hand side also be sampled, which also increases the
number of seeks It might be useful to augment the timestamp
partitioning to incorporate dynamic buffer allocation, though
it is not clear at the outset that this will yield a performance
benefit over our TP-H algorithm or over ETP-H
Dynamic buffer allocation for conventional joins was first
proposed by Harris and Ramamohanarao [22] They built the
cost model for nested loop and hash join algorithms with the
size of buffers as one of the parameters Then for each
algo-rithm they computed the optimal, or suboptimal but still good,
buffer allocation that led to the minimum join cost Finally,
the optimal buffer allocation was used to perform the join It
would be interesting to see if this strategy can improve the
per-formance of temporal joins It would also be useful to developcost models for the most promising temporal join algorithm(s),starting with ETP-H
The next logical progression in future work is to extend thiswork to index-based temporal joins, again investigating theeffectiveness of both explicit attribute indexing and timestampindexing While a large number of timestamp indexes havebeen proposed in the literature [44] and there has been somework on temporal joins that use temporal or spatial indexes [13,33,52,56], a comprehensive empirical comparison of thesealgorithms is needed
Orthogonally, more sophisticated techniques for ral database implementation should be considered In partic-ular, we expect specialized temporal database architectures tohave a significant impact on query processing efficiency Ithas been argued in previous work that incremental query eval-uation is especially appropriate for temporal databases [24,34,41] In this approach, a query result is materialized andstored back into the database if it is anticipated that the samequery, or one similar to it, will be issued in the future Updates
tempo-to the contributing relations trigger corresponding updates tempo-tothe stored result The related topic of global query optimiza-tion, which attempts to exploit commonality between multiplequeries when formulating a query execution plan, also has yet
to be explored in a temporal setting
Acknowledgements This work was sponsored in part by National
Science Foundation Grants IIS-0100436, CDA-9500991,
EAE-0080123, IRI-9632569, and IIS-9817798, by the NSF Research frastructure Program Grants EIA-0080123 and EIA-9500991, by theDanish National Centre for IT-Research, and by grants from Ama-zon.com, the Boeing Corporation, and the Nykredit Corporation
In-We also thank In-Wei Li and Joseph Dunn for their help in menting the temporal join algorithms
3 Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The
R∗-tree: an efficient and robust access method for points andrectangles In: Proceedings of the ACM SIGMOD conference,Atlantic City, NJ, 23–25 May 1990, pp 322–331
4 van den Bercken J, Seeger B (1996) Query processing niques for multiversion access methods In: Proceedings of theinternational conference on very large databases, Mubai (Bom-bay), India, 3–6 September 1996, pp 168–179
tech-5 B¨ohlen MH, Snodgrass RT, Soo MD (1997) Temporal ing In: Proceedings of the international conference on verylarge databases, Athens, Greece, 25–29 August 1997, pp 180–191
coalesc-6 Clifford J, Croker A (1987) The historical relational data model(HRDM) and algebra based on lifespans In: Proceedings of theinternational conference on data engineering, Los Angeles, 3–5February 1987, pp 528–537 IEEE Press, New York
7 Clifford J, Croker A (1993) The historical relational data model(HRDM) revisited In: Tansel A, Clifford J, Gadia S, Jajodia S,
Trang 27Segev A, Snodgrass RT (eds) Temporal databases: theory,
de-sign, and implementation, ch 1 Benjamin/Cummings, Reading,
MA, pp 6–27
8 Clifford J, Uz Tansel A (1985) On an algebra for historical
rela-tional databases: two views In: Proceedings of the ACM
SIG-MOD international conference on management of data, Austin,
TX, 28–31 May 1985, pp 1–8
9 DeWitt DJ, Katz RH, Olken F, Shapiro LD, Stonebraker MR,
Wood D (1984) Implementation techniques for main memory
database systems In: Proceedings of the ACM SIGMOD
in-ternational conference on management of data, Boston, 18–21
June 1984, pp 1–8
10 Dittrich JP, Seeger B, Taylor DS, Widmayer P (2002)
Pro-gressive merge join: a generic and non-blocking sort-based join
algorithm In: Proceedings of the conference on very large
data-bases, Madison, WI, 3–6 June 2002, pp 299–310
11 Dunn J, Davey S, Descour A, Snodgrass RT (2002) Sequenced
subset operators: definition and implementation In:
Proceed-ings of the IEEE international conference on data engineering,
San Jose, 26 February–1 March 2002, pp 81–92
12 Dyreson CE, Snodgrass RT (1993) Timestamp semantics and
representation Inform Sys 18(3):143–166
13 Elmasri R, Wuu GTJ, Kim YJ (1990) The time index: an
ac-cess structure for temporal data In: Proceedings of the
confer-ence on very large databases, Brisbane, Queensland, Australia,
13–16 August 1990, pp 1–12
14 Etzion O, Jajodia S, Sripada S (1998) Temporal databases:
research and practice Lecture notes in computer science,
vol 1399 Springer, Berlin Heidelberg New York
15 Gadia SK (1988) A homogeneous relational model and query
languages for temporal databases ACM Trans Database Sys
13(4):418–448
16 Gao D, Jensen CS, Snodgrass RT, Soo MD (2002) Join
operations in temporal databases TimeCenter TR-71
http://www.cs.auc.dk/TimeCenter/pub.htm
17 Gao D, Kline N, Soo MD, Dunn J (2002)TimeIT: the Time
integrated testbed, v 2.0 Available via anonymous FTP at:
ftp.cs.arizona.edu
18 Graefe G (1993) Query evaluation techniques for large
data-bases ACM Comput Surv 25(2):73–170
19 Graefe G, Linville A, Shapiro LD (1994) Sort vs hash revisited
IEEE Trans Knowl Data Eng 6(6):934–944
20 Gunadhi H, Segev A (1991) Query processing algorithms for
temporal intersection joins In: Proceedings of the IEEE
con-ference on data engineering, Kobe, Japan, 8–12 April 1991,
pp 336–344
21 Guttman A (1984) R-trees: a dynamic index structure for spatial
searching In: Proceedings of the ACM SIGMOD conference,
Boston, 18–21 June 1984, pp 47–57
22 Harris EP, Ramamohanarao K (1996) Join algorithm costs
re-visited J Very Large Databases 5(1):64–84
23 Jensen CS (ed) (1998) The consensus glossary of temporal
database concepts – February 1998 version In [14], pp 367–
405
24 Jensen CS, Mark L, Roussopoulos N (1991) Incremental
imple-mentation model for relational databases with transaction time
IEEE Trans Knowl Data Eng 3(4):461–473
25 Jensen CS, Snodgrass RT, Soo MD (1996) Extending existing
dependency theory to temporal databases IEEE Trans Knowl
Data Eng 8(4):563–582
26 Jensen CS, Soo MD, Snodgrass RT (1994) Unifying temporal
models via a conceptual model Inform Sys 19(7):513–547
27 Leung TY, Muntz R (1990) Query processing for temporal
databases In: Proceedings of the IEEE conference on data
en-gineering, Los Angeles, 6–10 February 1990, pp 200–208
28 Leung TYC, Muntz RR (1992) Temporal query processing andoptimization in multiprocessor database machines In: Proceed-ings of the conference on very large databases, Vancouver, BC,Canada, pp 383–394
29 Leung TYC, Muntz RR (1993) Stream processing: ral query processing and optimization In: Tansel A, Clifford
tempo-J, Gadia S, Jajodia S, Segev A, Snodgrass RT (eds) ral databases: theory, design, and implementation, ch 14, Ben-jamin/Cummings, Reading, MA, pp 329–355
Tempo-30 Li W, Gao D, Snodgrass RT (2002) Skew handling techniques
in sort-merge join In: Proceedings of the ACM SIGMOD ference on management of data Madison, WI, 3–6 June 2002,
con-pp 169–180
31 Lo ML, Ravishankar CV (1994) Spatial joins using seeded trees.In: Proceedings of the ACM SIGMOD conference, Minneapo-lis, MN, 24–27 May 1994, pp 209–220
32 Lo ML, Ravishankar CV (1996) Spatial hash-joins In: ings of ACM SIGMOD conference, Montreal, 4–6 June 1996,
Proceed-pp 247–258
33 Lu H, Ooi BC, Tan KL (1994) On spatially partitioned temporaljoin In: Proceedings of the conference on very large databases,Santiago de Chile, Chile, 12–15 September 1994, pp 546–557
34 McKenzie E (1988) An algebraic language for query and date of temporal databases Ph.D dissertation, Department ofComputer Science, University of North Carolina, Chapel Hill,NC
up-35 Mishra P, Eich M (1992) Join processing in relational databases.ACM Comput Surv 24(1):63–113
36 Navathe S, Ahmed R (1993) Temporal extensions to the tional model and SQL In: Tansel A, Clifford J, Gadia S, Jajodia
rela-S, Segev A, Snodgrass RT (eds) Temporal databases: theory, sign, and implementation Benjamin/Cummings, Reading, MA,
de-pp 92–109
37 Orenstein JA (1986) Spatial query processing in an oriented database system In: Proceedings of the ACMSIGMOD conference, Washington, DC, 28–30 May 1986,
object-pp 326–336
38 Orenstein JA, Manola FA (1988) PROBE spatial data modelingand query processing in an image database application IEEETrans Software Eng 14(5):611–629
39 ¨Ozsoyoˇglu G, Snodgrass RT (1995) Temporal and real-timedatabases: a survey IEEE Trans Knowl Data Eng 7(4):513–532
40 Patel JM, DeWitt DJ (1996) Partition based spatial-merge join.In: Proceedings of the ACM SIGMOD conference, Montreal,4–6 June 1996, pp 259–270
41 Pfoser D, Jensen CS (1999) Incremental join of time-orienteddata In: Proceedings of the international conference on scien-tific and statistical database management, Cleveland, OH, 28–30July 1999, pp 232–243
42 Ramakrishnan R, Gehrke J (2000) Database management tems McGraw-Hill, New York
sys-43 Rana S, Fotouhi F (1993) Efficient processing of time-joins
in temporal data bases In: Proceedings of the internationalsymposium on DB systems for advanced applications, Daejeon,South Korea, 6–8 April 1993, pp 427–432
44 Salzberg B, Tsotras VJ (1999) Comparison of access methodsfor time-evolving data ACM Comput Surv 31(2):158–221
45 Samet H (1990) The design and analysis of spatial data tures Addison-Wesley, Reading, MA
struc-46 Segev A (1993) Join processing and optimization in temporalrelational databases In: Tansel A, Clifford J, Gadia S, Jajo-dia S, Segev A, Snodgrass RT (eds) Temporal databases: the-ory, design, and implementation, ch 15 Benjamin/Cummings,Reading, MA, pp 356–387
Trang 2847 Segev A, Gunadhi H (1989) Event-join optimization in temporal
relational databases In: Proceedings of the conference on very
large databases, Amsterdam, 22–25 August 1989, pp 205–215
48 Sellis T, Roussopoulos N, Faloutsos C (1987) The R+-tree: a
dynamic index for multidimensional objects In: Proceedings
of the conference on very large databases, Brighton, UK, 1–4
September 1987, pp 507–518
49 Sitzmann I, Stuckey PJ (2000) Improving temporal joins using
histograms In: Proceedings of the international conference on
database and expert systems applications, London/Greenwich,
UK, 4–8 September 2000, pp 488–498
50 Slivinskas G, Jensen CS, Snodgrass RT (2001) A foundation
for conventional and temporal query optimization addressing
duplicates and ordering Trans Knowl Data Eng 13(1):21–49
51 Snodgrass RT, Ahn I (1986) Temporal databases IEEE Comput
19(9):35–42
52 Son D, Elmasri R (1996) Efficient temporal join processing
using time index In: Proceedings of the conference on statistical
and scientific database management, Stockholm, Sweden, 18–
20 June 1996, pp 252–261
53 Soo MD, Jensen CS, Snodgrass RT (1995) An algebra forTSQL2 In: Snodgrass RT (ed) The TSQL2 temporal querylanguage, ch 27, Kluwer, Amsterdam, pp 505–546
54 Soo MD, Snodgrass RT, Jensen CS (1994) Efficient tion of the valid-time natural join In: Proceedings of the inter-national conference on data engineering, Houston, TX, 14–18February 1994, pp 282–292
evalua-55 Tsotras VJ, Kumar A (1996) Temporal database bibliographyupdate ACM SIGMOD Rec 25(1):41–51
56 Zhang D, Tsotras VJ, Seeger B (2002) Efficient temporal joinprocessing using indices In: Proceedings of the IEEE interna-tional conference on data engineering, San Jose, 26 February–1March 2002, pp 103–113
57 Zurek T (1997) Optimisation of partitioned temporal joins.Ph.D Dissertation, Department of Computer Science, Edin-burgh University, Edinburgh, UK
Trang 29Storing and querying XML data using denormalized relational databases
Andrey Balmin, Yannis Papakonstantinou
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093
(e-mail:{abalmin,yannis}@cs.ucsd.edu)
Edited by A Halevy Received: December 21, 2001 / Accepted: July 1, 2003
Published online: June 23, 2004 – c Springer-Verlag 2004
Abstract XML database systems emerge as a result of the
ac-ceptance of the XML data model Recent works have followed
the promising approach of building XML database
manage-ment systems on underlying RDBMS’s Achieving query
pro-cessing performance reduces to two questions: (i) How should
the XML data be decomposed into data that are stored in the
RDBMS? (ii) How should the XML query be translated into an
efficient plan that sends one or more SQL queries to the
under-lying RDBMS and combines the data into the XML result? We
provide a formal framework for XML Schema-driven
decom-positions, which encompasses the decompositions proposed
in prior work and extends them with decompositions that
em-ploy denormalized tables and binary-coded XML fragments
We provide corresponding query processing algorithms that
translate the XML query conditions into conditions on the
relational tables and assemble the decomposed data into the
XML query result Our key performance focus is the response
time for delivering the first results of a query The most
effec-tive of the described decompositions have been implemented
in XCacheDB, an XML DBMS built on top of a commercial
RDBMS, which serves as our experimental basis We present
experiments and analysis that point to a class of
decomposi-tions, called inlined decomposidecomposi-tions, that improve query
per-formance for full results and first results, without significant
increase in the size of the database
1 Introduction
The acceptance and expansion of the XML model creates a
need for XML database systems [3,4,8,10,15,19,23,25,31,
32,34,35,41] One approach towards building XML DBMS’s
is based on leveraging an underlying RDBMS for storing
and querying the XML data This approach allows the XML
database to take advantage of mature relational technology,
which provides reliability, scalability, high performance
in-dices, concurrency control and other advanced functionality
Andrey Balmin has been supported by NSF IRI-9734548.
The authors built the XCacheDB system while on leave at Enosys
Software, Inc., during 2000
Schema Info
RDBMS
Data Storage
Tables’
Def.
XCacheDBQuery Processor
XML Results
Tuple Streams
Loading
Tuple Insertion
Querying
View XQuery
SQL Query
Optional User Guidance
Schema Processor
Schema Decomposition
Data Decomposer XML
Data
XML Schema
XCacheDBLoader
Fig 1 The XML database architecture
We provide a formal framework for XML Schema-drivendecompositions of the XML data into relational data Thedescribed framework encompasses the decompositions de-scribed in prior work on XML Schema-driven decompositions[3,34] and extends prior work with a wide range of decom-positions that employ denormalized tables and binary-codednon-atomic XML fragments
The most effective among the set of the described
decom-positions have been implemented in the presented XCacheDB,
an XML DBMS built on top of a commercial RDBMS [5].XCacheDB follows the typical architecture (see Fig 1) of
an XML database built on top of a RDBMS [3,8,23,32,34].First, XML data, accompanied by their XML Schema [38], is
loaded into the database using the XCacheDB loader, which consists of two modules: the schema processor and the data
decomposer The schema processor inputs the XML Schema
and creates in the underlying relational database tables quired to store any document conforming to the given XMLschema The conversion of the XML schema into relationalmay use optional user guidance The mapping from the XML
Trang 30re-schema to the relational is called re-schema decomposition.1The
data decomposer converts XML documents conforming to the
XML schema into tuples that are inserted into the relational
database
XML data loaded into the relational database are queried
by the XCacheDB query processor The processor exports an
XML view identical to the imported XML data A client
is-sues an XML query against the view The processor translates
the query into one or more SQL queries and combines the
result tuples into the XML result Notice that the underlying
relational database is transparent to the query client
The key challenges in XML databases built on relational
systems are
1 how to decompose the XML data into relational data,
2 how to translate the XML query into a plan that sends
one or more SQL queries to the underlying RDBMS and
constructs an XML result from the relational tuple streams
A number of decomposition schemes have been proposed
[3,8,11,34] However all prior works have adhered to
de-composing into normalized relational schemas Normalized
decompositions convert an XML document into a typically
large number of tuples of different relations Performance is
hurt when an XML query that asks for some parts of the
origi-nal XML document results in an SQL query (or SQL queries)
that has to perform a large number of joins to retrieve and
reconstruct all the necessary information
We provide a formal framework that describes a wide space
of XML Schema-driven denormalized decompositions and we
explore this space to optimize query performance Note that
denormalized decompositions may involve a set of relational
design anomalies; namely, non-atomic values, functional
de-pendencies and multivalued dede-pendencies Such anomalies
in-troduce redundancy and impede the correct maintenance of
the database [14] However, given that the decomposition is
transparent to the user, the introduced anomalies are
irrele-vant from a maintenance point of view Moreover, the XML
databases today are mostly used in web-based query
sys-tems where datasets are updated relatively infrequently and
the query performance is crucial Thus, in our analysis of the
schema decompositions we focus primarily on their
repercus-sions on query performance and secondarily on storage space
and update speed
The XCacheDB employs the most effective of the
de-scribed decompositions It employs two techniques that trade
space for query performance by denormalizing the relational
data
• non-Normal Form (non-NF) tables eliminate many joins,
along with the particularly expensive join start-up time
• BLOBs are used to store pre-parsed XML fragments, hence
facilitating the construction of XML results BLOBs
elim-inate the joins and “order by" clauses that are needed for
the efficient grouping of the flat relational data into nested
XML structures, as it was previously shown in [33]
Overall, both of the techniques have a positive impact on
total query execution time in most cases The results are most
impressive when we measure the response time, i.e the time it
takes to output the first few fragments of the result Response
1
XCacheDB stores it in the relational database as well
time is important for web-based query systems where userstend to, first, issue under-constrained queries, for purposes
of information discovery They want to quickly retrieve thefirst results and then issue a more precise query At the sametime, web interfaces do not need more than the first few resultssince the limited monitor space does not allow the display oftoo much data Hence it is most important to produce the firstfew results quickly
Our main contributions are:
• We provide a framework that organizes and formalizes a
wide spectrum of decompositions of the XML data intorelational databases
• We classify the schema decompositions based on the
de-pendencies in the produced relational schemas We
iden-tify a class of mappings called inlined decompositions that
allow us to considerably improve query performance byreducing the number of joins in a query, without a signifi-cant increase in the size of the database
• We describe data decomposition, conversion of an XML
query into an SQL query to the underlying RDBMS, andcomposition of the relational result into the XML result
• We have built in the XCacheDB system the most effective
of the possible decompositions
• Our experiments demonstrate that under typical
condi-tions certain denormalized decomposicondi-tions provide icant improvements in query performance and especially
signif-in query response time In some cases, we observed up to400% improvement in total time (Fig 23, Q1 with selec-tivity 0.1%) and 2-100 times in response time (Fig 23, Q1with selectivity above 10%)
The rest of this paper is organized as follows In Sect 2
we discuss related work In Sect 3, we present definitions andframework Section 4 presents the decompositions of XMLSchemas into sets of relations In Sect 5, we present algo-rithms for translating the XML queries into SQL, and assem-bling the XML results In Sect 6, we discuss the architecture
of XCacheDB along with interesting implementation aspects
In Sect 7, we present the experiment results We conclude anddiscuss directions for future work in Sect 8
2 Related work
The use of relational databases for storing and querying XMLhas been advocated before by [3,8,11,23,32,34] Some ofthese works [8,11,23] did not assume knowledge of an XMLschema In particular, the Agora project employed a fixed re-lational schema, which stores a tuple per XML element Thisapproach is flexible but it is less competitive than the otherapproaches, because of the performance problems caused bythe large number of joins in the resulting SQL queries TheSTORED system [8] also employed a schema-less approach.However, STORED used data mining techniques to discoverpatterns in data and automatically generate XML-to-Relation-
al mappings
The works of [34] and [3] considered using DTD’s andXML Schemas to guide mapping of XML documents into re-lations [34] considered a number of decompositions leading
to normalized tables The “hybrid" approach, which providesthe best performance, is identical to our “minimal 4NF decom-position" The other approaches of [34] can also be modeled
Trang 31by our framework In one respect our model is more restrictive,
as we only consider DAG schemas while [34] also takes into
account cyclic schemas It is possible to extend our approach
to arbitrary schema graphs by utilizing their techniques [3]
studies horizontal and vertical partitioning of the minimal 4NF
schemas Their results are directly applicable in our case
How-ever we chose not to experiment with those decompositions,
since their effect, besides being already studied, tends to be
less dramatic than the effect of producing denormalized
rela-tions Note also that [3] uses a cost-based optimizer to find
an optimal mapping for a given query mix The query mix
approach can benefit our work as well
To the best of our knowledge, this is the first work to use
denormalized decompositions to enhance query performance
There are also other related works in the intersection of
relational databases and XML The construction of XML
re-sults from relational data was studied by [12,13,33] [33]
con-sidered a variety of techniques for grouping and tagging
re-sults of the relational queries to produce the XML documents
It is interesting to note the comparison between the “sorted
outer union" approach and BLOBs, which significantly
im-prove query performance The SilkRoute [12,13] considered
using multiple SQL queries to answer a single XML Query and
specified the optimal approach for various situations, which
are applicable in our case as well
Oracle 8i/9i, IBM DB2, and Microsoft SQL Server provide
some basic XML support [4,19,31] None of these products
support XQuery or any other full-featured XML query
lan-guage as of May 2003
Another approach towards storing and querying XML is
based on native XML and OODB technologies [15,25,35]
The BLOBs resemble the common object-oriented technique
of clustering together objects that are likely to be queried and
retrieved jointly [2] Also, the non-normal form relations that
we use are similar to path indices, such as the “access support
relations" proposed by Kemper and Moerkotte [20] An
im-portant difference is that we store data together with an index,
similarly to Oracle’s “index organized tables" [4]
A number of commercial XML databases are avaliable
Some of these systems [9,21,24] only support API data access
and are effectively persistent implementations of the
Docu-ment Object Model [36] However, most of the systems [1,
6,10,17,18,26,27,35,40–42,44] implement the XPath query
language or its variations Some vendors [10,26,35] have
an-nounced XQuery [39] support in the upcoming versions,
how-ever only X-Hive 3.0 XQuery processor [41] and Ipedo XML
Database [18] were publically available at the time of writing
The majority of the above systems use native XML
stor-age, but some [10,40,41] are implemented on top of
object-oriented databases Besides the query processing some of the
commercial XML databases support full text searches [18,41,
44], transactional updates [6,10,18,26,40,42] and document
versioning [18,40]
Even though XPath does not support heterogeneous joins,
some systems [27,35] recognize their importance for the data
integration applications and provide facilities that enable this
feature
Our work concentrates on selection and join queries
An-other important class of XML queries involve path
expres-sions A number of schemes [16,22] have been proposed
re-cently that employ various node numbering techniques to
fa-cilitate evaluation of path expressions For instance, [22] poses to use pairs of numbers (start position and sub-tree size)
pro-to identify nodes The XSearch system [43] employs Deweyencoding of node IDs to quickly test for ancestor-descendantrelationships These techniques can be applied in the context
of XCacheDB, since the only restriction that we place on nodeIDs is their uniqueness
3 Framework
We use the conventional labeled tree notation to representXML data The nodes of the tree correspond to XML ele-ments, and are labeled with the elements’ tags Tags that startwith the “@" symbol stand for attributes Leaf nodes may also
be labeled with values that correspond to the string content.Note that we treat XML as a database model that allows forrich structures that contain nesting, irregularities, and struc-tural variance across the objects We assume the presence ofXML Schema, and expect the data to be accessed via an XMLquery language such as XQuery We have excluded many doc-ument oriented features of XML such as mixed content, com-ments and processing instructions
Every node has a unique id invented by the system Theid’s play an important role in the conversion of the tree torelational data, as well as in the reconstruction of the XMLfragments from the relational query results
Definition 1 (XML document) An XML document is a tree
where
1 Every node has a label l coming from the set of element tags L
2 Every node has a unique id
3 Every atomic node has an additional label v coming from the set of values V Atomic nodes can only be leafs of the
♦
Figure 2 shows an example of an XML document tree Wewill use this tree as our running example We consider onlyunordered trees We can extend our approach to ordered treesbecause the node id’s are assigned by a depth first traversal ofthe XML documents, and can be used to order sibling nodes
3.1 XML schema
We use schema graphs to abstract the syntax of XML Schema
Definitions [38] The following example illustrates the nection between XML Schemas and schema graphs
con-Example 1 Consider the XML Schema of Fig 3 and the
corre-sponding schema graph of Fig 4 They both correspond to theTPC-H [7] data of Fig 2 The schema indicates that the XML
data set has a root element named Customers, which contains one or more Customer elements Each Customer contains (in some order) all of the atomic elements Name, Address, and
MarketSegment, as well as zero or more complex elements Order and PreferedSupplier These complex elements in turn
contain other sets of elements
2However, not every leaf has to be an atomic node Leafs can also
be empty elements
Trang 32LineItem [id=21]
Part [id=22]
[3655]
Supplier [id=23]
[415]
Price [id=23]
[57670.05]
Quantity [id=24]
[37.0]
Discount [id=25]
[0.09]
Preferred Supplier [id=36]
Name [id=38]
[“Supplier10”]
Number [id=37]
[10]
Address [id=39]
Nation [id=42]
[“USA”]
Street [id=40]
[“1 supplier10 st.”]
City [id=41]
[“San Diego,
CA 92126”]
Customer [id=2]
Address [id=43]
[“1 furniture way,
CA 92093”]
Market Segment [id=44]
[“furniture”]
Name [id=28]
[“Customer#1”]
Order [id=13]
Status [id=20]
[“F”]
Number [id=14]
[135943]
Price [id=26]
[263247.53]
Date [id=27]
[6/22/1993 0:0:0]
Customers [id=1]
Preferred Supplier [id=29]
Number [id=30]
[415]
Address [id=32]
Nation [id=35]
[“USA”]
City [id=34 [“San Diego,
Part [id=16]
[9897]
Supplier [id=17]
[416]
Price [id=18]
[66854.93]
Quantity [id=19]
[37.0]
Fig 2 A sample TPCH-like XML data set Id’s and data values appear in brackets
Notice that XML schemas and schema graphs are in some
respect more powerful than DTDs [37] For example, in the
schema graph of Fig 4 both Customer and Supplier have
Address subelements, but the customer’s address is simply
a string, while the supplier’s address consists of Street and
City elements DTD’s cannot contain elements with the same
name, but different content types
Definition 2 (Schema graph) A schema is a directed graph
1 Every node has a label l that is one of “all", or “choice",
or is coming from the set of element tags L Nodes labeled
“all" and “choice" have at least two children.
2 Every leaf node has a label t coming from the set of types
T
3 Every edge is annotated with “minOccurs" and
“maxOc-curs" labels, which can be a non-negative integer or
“un-bounded".
4 A single node r is identified as the “root" Every node of
♦
Schema graph nodes labeled with element tags are called
tag nodes; the rest of the nodes are called link nodes.
Since we use an unordered data model, we do not include
“sequence" nodes in the schema graphs Their treatment is
identical to that of “all" nodes We also modify the usual
defi-nition of a valid document to account for the unordered model.
To do that, we, first, define the content type of a schema node,
which defines bags of sibling XML elements that are validwith respect to the schema node
Definition 3 (Content type) Every node g of a schema graph
schema nodes, defined by the following recursive rules.
• If g is a tag node, T (g) = {{g}}
min i (g i ), where
min i (g i ) is a union of all bags obtained by
min i (g i ) also includes an empty
bag.
• If g is an “all" node g = all(g1, , g n ), then T (g) is a
union of all bags obtained by concatenation of n bags –
min i (g i ).
♦
Definition 4 (Document tree valid wrt schema graph) We
say that a document tree T is valid with respect to schema
Trang 33<?xml version = "1.0" encoding = "UTF-8"?>
<xsd:element ref = "number"/>
<xsd:e lement ref = "name"/>
<xsd:element ref = "address"/>
<xsd:element ref = "market"/>
<xsd:element ref = "orders" minOccurs = "0" maxOccurs = "unbounded"/>
<xsd:element ref = "preferred_supplier" minOccurs = "0" maxOccurs = "unbounded"/>
</xsd:all>
</xsd:complexType>
</xsd:element>
<xsd:element name = "number" type = "xsd:integer"/>
<xsd:element name = "name" type = "xsd:string"/>
<xsd:element name = "address" type = "xsd:string"/>
<xsd:element name = "market" type = "xsd:string" />
<xsd:element name = "orders">
<xsd:complexType>
<xsd:all>
<xsd:element ref = "number"/>
<xsd:element ref = "status"/>
<xsd:element ref = "price"/>
<xsd:element ref = "date"/>
<xsd:element ref = "lineitem" minOccurs = "0" maxOccurs = "unbounded"/>
<xsd:element ref = "number"/>
<xsd:element ref = "name"/>
<xsd:element ref = "address"/>
<xsd:element ref = "nation"/>
<xsd:element ref = "balance"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name = "status" type = "xsd:string"/>
<xsd:element name = "price" type = "xsd:float"/>
<xsd:element name = "date " type = "xsd:string"/>
<xsd:element name = "lineitem">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref = "part"/>
<xsd:element ref = "supplier"/>
<xsd:element ref = "quantity"/>
<xsd:element ref = "price"/>
<xsd:element ref = "disc ount" minOccurs = "0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name = "part" type = "xsd:integer"/>
<xsd:element name = "supplier" type = "xsd:integer"/>
<xsd:element name = "quantity" type = "xsd:float"/>
<xsd:element nam e = "discount" type = "xsd:float"/>
<xsd:element name = "nation" type = "xsd:string"/>
<xsd:element name = "balance" type = "xsd:float"/>
min (g c ), where g c is the child of g, and min
♦
Figure 5 illustrates how the content types are assigned
and used in the document validation The Address element
on the right is valid with respect to the schema graph on the
left Each schema node is annotated with its content type
LineItem
Part [Integer]
Supplier [Integer]
Price [Float] Quantity[Float]
Discount [Float]
?
Customers
Address [String]
Market Segment [String]
Name [String]
Number [Integer]
Price [Float]
Date [Date]
All
Preferred Supplier
Name [String]
Number [Integer] Address
Nation [String] Street
[String]
City [String]
* +
Street [String]
Choice All
[92126]
Address [id = 6]
Fig 5 Content types and document tree validation
For example, the type of the “choice" node is{{Street},{PO
tree nodes to the tag nodes of the schema graph (mappings areshown by the dashed lines), in such a way that the bag of typescorresponding to the children of every XML node is a member
of the content type of the child of the corresponding schema
node For example, the children of the Address element belong
to the content type of the “all" node
Normalized schema graphs To simplify the presentation we
only consider normalized schema graphs, where all ing edges of link nodes have maxOccurs= 1 Any schemagraph can be converted into, a possibly less restrictive, nor-malized schema graph by a top-down breadth-first traversal ofthe schema graph that applies the following rules For every
incom-link node N that has an incoming edge with minOccurs=
edge of N is multiplied by inM ax The result of the
prod-uct is “unbounded" if at least one parameter is “unbounded"
Similarly, if inM in > 1, the minOccurs is set to 1 and the
inM in Also, if N is a “choice", it gets replaced with an “all"
node with the same set of children, and for every outgoing edge
the minOccur is set to 0 For example, the schema graph of
Fig 6a will be normalized into the graph of Fig 6c Notice thatthe topmost “choice" node is replaced by “all", since a cus-
Trang 34Zip [String]
Street [String]
*
All
Choice All
PO Box [String]
Customer
Choice
Preferred Supplier Address
PO Box [String]
Zip [String]
Street [String]
Choice All
PO Box [String]
Without loss of generality to the decomposition
algo-rithms described next, we only consider schemas where
unbounded We use the following symbols: “1", “*", “?", “+",
to encode the “minOccurs"/“maxOccurs" pairs For brevity,
we omit “1" annotations in the figures We also omit “all"
nodes if their incoming edges are labeled “1", whenever this
doesn’t cause an ambiguity
We only consider acyclic schema graphs Schema graph
nodes that are pointed by a “*" or a “+" will be called
repeat-able.
4 XML decompositions
We describe next the steps of decomposing an XML document
into a relational database First, we produce a schema
decom-position, i.e., we use the schema graph to create a relational
schema Second, we decompose the XML data and load it into
the corresponding tables We use the schema decomposition
to guide the data load
The generation of an equivalent relational schema
pro-ceeds in two steps First, we decompose the schema graph
into fragments Second, we generate a relational table
defini-tion for each fragment
Definition 5 (Schema decomposition) A schema
where each fragment is a subset of nodes of G that form a
connected DAG Every tag node of G has to be member of at
Due to acyclicity of the schema graphs, each fragment
has at least one fragment root node, i.e., a node that does not
have incoming edges from any other node of the fragment
Similarly, fragment leaf nodes are the nodes that do not have
outgoing edges that lead to other nodes of the fragment Note
that a schema decomposition is not necessarily a partition of
?
Status
Number
Price Date
Preferred Supplier
Name
Number
Nation Address Street City
* Customer Address
Market Segment Name
*
Customers +
*
Fig 7 An XML schema decomposition
the schema graph – a node may be included in multiple ments (Fig 7)
frag-Some fragments may contain only “choice" and “all"
nodes We call these fragments trivial, since they correspond
to empty data fragments We only consider decompositions
which contain connected, non-trivial fragments, where all
fragment leafs are tag nodes
DAG schemas offer an extra degree of freedom, since anequivalent schema can be obtained by “splitting" some of thenodes that have more than one ancestor For example, theschema of Fig 6b, can be obtained from the schema of Fig 6a
by splitting at element Address Such a split corresponds to a
derived horizontal partitioning of a relational schema [28].
Similarly, element nodes may also be eliminated by
“com-bining" nodes For example, an all(a∗, b, a∗) may be reduced
to all (a∗, b) if types of both a’s are equal3 Since we sider an unordered data model, the queries cannot distinguish
con-between “first" and “second" a’s in the data Thus, we do not
need to differentiate between them A similar DTD reductionprocess was used in [34] However, unlike [34] our decompo-sitions do not require reduction and offer flexibility needed tosupport the document order Similar functionality is included
in LegoDB [3]
Definition 6 (Path set, equivalent schema graphs) A path
possible paths in G that originate at the root of G Two schema
We define the set of generalized schema decompositions
of a graph G to be the set of schema decompositions of all graphs G that are equivalent to G (including the schema de- compositions of G itself.) Whenever it is obvious from the
context we will say “set of schema decompositions" implyingthe set of generalized schema decompositions
Definition 7 (Root fragments, parent fragments) A root
fragment is a fragment that contains the root of the schema
graph For each non-root fragment F we define its Parent
Frag-ments in the following way: Let R be a root node of F , and
3
We say that typesA and B are equal, if every element that is
valid wrtA is also valid wrt B, and vice versa.
Trang 35Street [id = 4]
[“9500 Gilman Dr”] PO Box[id = 8]
[1000]
Zip [id = 3]
[92093] [id = 7]Zip
[92126]
address_
id
Address
1 null 2 3 92093 4 “9500 Gilman Dr.” null null
null 5 6 7 92126 null null 8 1000
Address [id = 6]
Customer [id=1]
Supplier
Customer
Supplier [id = 5]
zip_id zip stre
d
stre
POBox_i d
POBox
custo
mer_
ref
supplie f
address_ref street_id street
2 4 “9500 Gilman Dr.”
Street Sequence
address_ref POBox_id POBox
6 8 1000 POBox
address_ref zip_id zip
2 3 92093
Zip
address_ref street_id street POBox_id POBox
2 4 “9500 Gilman Dr.” null null
Fig 8 Loading data into fragment tables
let P be a parent of R in the schema graph Any fragment that
Definition 8 (Fragment table) A Fragment Table T
If N is an atomic node the schema tree T also has an attribute
each distinct path that leads to a root of F from a repeatable
ancestor A and does not include any intermediate repeatable
ancestors The parent reference columns store the value of the
For example, consider the Address fragment table of
Fig 8 Regardless of other fragments present in the
decom-position, theAddresstable will have two parent reference
columns One column will refer to the Customer element and
another to the Supplier Since we consider only tree data,
ev-ery tuple of theAddresstable will have exactly one non-null
parent reference
A fragment table is named after the left-most root of the
corresponding fragment Since multiple schema nodes can
have the same name, name collisions are resolved by
append-ing a unique integer
We use null values in ID columns to represent missing
op-tional elements For example, the null value in thePOBox id
of the first tuple of theAddresstable indicates that the
Ad-dress element with id=2 does not have a POBox subelement.
An empty XML element N is denoted by a non-null value in
A N ID and a null in A N
Data load We use the following inductive definition of
frag-ment tables’ content First, we define the data content of a
fragment consisting of a single tag node N The fragment
ta-ble T N , called node table, contains an ID attribute A N ID, a
value attribute A N, and one or more parent attributes Let us
4 Note that a decomposition can have multiple root fragments, and
a fragment can have multiple parent fragments
Street [String]
Choice All
PO Box [String]
Address_id Zip_id Zip Street_id Street
Street [String]
Choice All
PO Box [String]
Address_id Zip_id Zip Street_id Street
Fig 9a,b Alternative fragmentations of data of Fig 8
consider a Typed Document Tree D , where each node of D is mapped to a node of the schema graph A tuple is stored in T N for each node d ∈ D, such that (d → N) ∈ D Assume that
d is a child of the node p ∈ D, such that (p → P ) ∈ D .
The table T N will be populated with the following tuple:
A P ID = p id , A N ID = d id , A N = d If T N contains
par-ent attributes other than A P ID, they are set to null
A table T corresponding to an internal node N is populated
depending on the type of the node
• If N is an “all", then T is the result of a join of all children
tables on parent reference attributes
of all children tables
• If N is a tag node, which by definition has exactly one child
node with a corresponding table T C , then T = T N 1 T C
The following example illustrates the above definition tice that the XCacheDB Loader does not use the brute forceimplementation suggested in the example We employ opti-mizations that eliminate the majority of the joins
No-Example 2 Consider the schema graph fragment, and the
cor-responding data fragment of Fig 8 TheAddressfragmenttable is built from node tablesZip,Street, andPOBox, ac-cording to the algorithm described above A table correspond-ing to the “choice" node in the schema graph is built by taking
an outer union ofStreetandPOBox The result is joined withZipto obtain the table corresponding to the “all" node The re-sult of the join is, in turn, joined with theAddressnode table(not shown) which contains three attributes “customer ref",
“supplier ref", and “address id"
Alternatively, the “Address" fragment of Fig 8 can be split
in two as shown in Fig 9a and b The dashed lines in Fig 9b
indicates that a horizontal partitioning of the fragment should
occur along the “choice" node This line indicates that the ment table should be split into two Each table projects out at-tributes corresponding to one side of the “choice" The tuples6
frag-Outer union of two tablesP and Q is a table T , with a set of
attributesattr(T ) = attr(P ) ∪ attr(Q) The table T contains all
tuples ofP and Q extended with nulls in all the attributes that were
not present in the original
Trang 36of the original table are partitioned into the two tables based
on the null values of the projected attributes This operation
is similar to the “union distribution" discussed in [3]
Hori-zontal partitioning improves the performance of queries that
access either side of the union (e.g., either Street or POBox
el-ements) However, performance may degrade for queries that
access only Zip elements Since we assume no knowledge of
the query workload, we do not perform horizontal partitioning
automatically, but leave it as an option to the system
adminis-trator
The following example illustrates decomposing
TPCH-like XML schema of Fig 4 and loading it with data of Fig 2
Example 3 Consider the schema decomposition of Fig 10.
The decomposition consists of three fragments rooted at the
elements Customers, Order, and Address Hence the
corre-sponding relational schema has tablesCustomers,Order,
andAddress The bottom part of Fig 10 illustrates the
con-tents of each table for the dataset of Fig 2 Notice that the
tablesCustomersandOrderare not in BCNF
For example, the tableOrderhas the non-key functional
dependency “order id → number id", which introduces
re-dundancy
We use “(FK)" labels in Fig 10 to indicate parent
refer-ences Technically these references are not foreign keys since
they do not necessarily refer to a primary key
Alternatively one could have decomposed the example
schema as shown in Fig 7 In this case there is a non-FD
multi-valued dependency (MVD) in theCustomerstable, i.e., an
MVD that is not implied by a functional dependency Orders
and preferred suppliers of every customer are independent of
each other:
customers id, customer id, c name id, c address id,
c marketSegment id, c name, c address,
p name id, p number id, p nation id, p name,
p number, p nation, p address id, a street id,
a city id, a street, a city
The decompositions that contain non-FD MVD’s are
called MVD decompositions.
Vertical partitioning In the schema of Fig 10 the Address
element is not repeatable, which means that there is at most
one address per supplier Using a separateAddresstable is an
example of vertical partitioning because there is a one-to-one
relationship between theAddresstable and its parent table
Customers The vertical partitioning of XML data was
stud-ied in [3], which suggests that partitioning can improve
perfor-mance if the query workload is known in advance Knowing
the groups of attributes that get accessed together, the vertical
partitioning can be used to reduce table width without
incur-ring a big penalty from the extra joins We do not consider
vertical partitioning in this paper, but the results of [3] can be
carried over to our approach We use the term minimal to refer
to decompositions without vertical partitioning
Definition 9 (Minimal decompositions) A decomposition is
minimal if all edges connecting nodes of different fragments
Figure 7 and Fig 11 show two different minimal positions of the same schema We call the decomposition of
decom-Fig 11 a 4NF decomposition because all its fragments are
4NF fragments (i.e the fragment tables are in 4NF) Note that
a fragment is 4NF if and only if it does not include any “*" or
“+" labeled edges, i.e no two nodes of the fragment are nected by a “*" or “+" labeled edge We assume that the onlydependencies present are those derived by the decomposition.Every XML Schema tree has exactly one minimal 4NF de-composition, which minimizes the space requirements Fromhere on, we only consider minimal decompositions
con-Prior work [3,34] considers only 4NF decompositions.However we employ denormalized decompositions to im-prove query execution time as well as response time Par-ticularly important for performance purposes is the class ofinlined decompositions described below The inlined decom-positions improve query performance by reducing the number
of joins, and (unlike MVD decompositions) the space head that they introduce depends only on the schema and not
over-on the dataset
Definition 10 (NonMVD decompositions and inlined
de-compositions) A non-MVD fragment is one where all “*"
and “+" labeled edges appear in a single path A non-MVD decomposition is one that has only non-MVD fragments An
inlined fragment is a non-MVD fragment that is not a 4NF
fragment An inlined decomposition is a non-MVD
The non-MVD fragment tables may have functional pendencies (FD’s) that violate the BCNF condition (and alsothe 3NF condition [14]), but they have no non-FD MVD’s For
de-example, the Customers table of Fig 10 contains the FD
that breaks the BCNF condition, since the key is
“c preferredSupplier id" However, the table has no non-FDMVD’s
From the point of view of the relational data, an inlined
fragment table is the join of fragment tables that correspond
to a line of two or more 4NF fragments For example, the
fragment table Customers of Fig 10 is the join of the ment tables that correspond to the 4NF fragments Customers and PreferredSupplier of Fig 11 The tables that correspond
frag-to inlined fragments are very useful because they reduce thenumber of joins while they keep the number of tuples in thefragment tables low
Lemma 1 (Space overhead as a function of schema size).
data set, the number of tuples of F is less than the total number
Proof Let’s consider the following three cases First, if the
schema tree edge that connects F1and F2is labeled with “1"
or “?", the tuples of F2will be inlined with F1 Thus F will have the same number of tuples as F1
7
A fragment consisting of two non-MVD fragments connectedtogether, is not guaranteed to be non-MVD
Trang 37address_id c_preferredSupplier_ref street_id street city_id city
Address
32 29 33 “1 supplier10 St.” 34 “San Diego, CA 92126”
39 36 40 “1 supplier415 St.” 41 “San Diego, CA 92126”
1 2 28 “Customer1” 43 “1 Furniture ” 44 “furniture” 29 30 415 31 “Supplier415” 35 “USA”
1 2 28 “Customer1” 43 “1 Furniture ” 44 “furniture” 36 37 10 38 “Supplier10” 42 “USA”
l_
nt d
?
Status
Number
Price Date
Preferred Supplier
Name
Number
Nation Address
*
Customer Address
Market Segment Name
*
Customers +
*
Customers
customers_id customer_id
c_name_id
c_name c_address_id c_address c_marketSegment_id c_marketSegment
c_preferredSupplier_id(PK)
p_number_id p_number p_name_id p_name p_nation_id p_nation
address_id (PK)
c_preferredSupplier_ref (FK) street_id
street city_id city
Address
order_id customer_ref (FK) number_id number status_id status price_id price date_id date
lineitem _id (P K)
l_part_id l_part l_supplier_id l_supplier l_price_id l_price l_quantity_id l_quantity l_discount_id l_discount
Trang 38Status
Number
Price Date
Preferred Supplier
Name
Number
Nation Address Street City
*
Customer Address
Market Segment Name
*
Customers +
*
Fig 11 Minimal 4NF XML schema decomposition
All Decompositions
4NF MVD
Minimal Decompositions
Inlined Non-MVD
Fig 12 Classification of schema decompositions
Second, if the edge is labeled with “+", F will have the same
number of tuples as F2, since F will be the result of the join
of F1and F2, and the schema implies that for every tuple in
F2, there is exactly one matching tuple, but no more in F1.
Third, if the edge is labeled with “*", F will have fewer tuples
than the total of F1and F2, since F will be the result of the
left outer join of F1and F2
We found that the inlined decompositions can provide
sig-nificant query performance improvement Noticeably, the
stor-age space overhead of such decompositions is limited, even if
the decomposition include all possible non-MVD fragments
Definition 11 (Complete non-MVD decompositions) A
complete non-MVD decomposition, complete for short, is one
The complete non-MVD decompositions are only
in-tended for the illustrative purpose, and we are not advocating
their practical use
Note that a complete non-MVD decomposition includes
all fragments of the 4NF decomposition The other fragments
of the complete decomposition consist of fragments of the
4NF decomposition connected together In fact, a 4NF
de-composition can be viewed as a tree of 4NF fragments, called
4NF fragment tree The fragments of a complete minimal
non-MVD decomposition correspond to the set of paths in this
tree The space overhead of a complete decompositions is afunction of the size of the 4NF fragment tree
Lemma 2 (Space overhead of a complete decomposition
as a function of schema) Consider a schema graph G, its
of tuples of the complete decomposition is
|D C (G)| =k
i=1
where h is the height of the 4NF fragment tree of G, and n is
Proof Consider a record tree R constructed from an XML
document tree T in the following fashion A node of the record
tree is created for every tuple of the 4NF data decomposition
D 4NF (T ) Edges of the record tree denote child-parent
rela-tionships between tuples There is a one to one mapping frompaths in the record tree to paths in its 4NF fragment tree, and
the height of the record tree h equals to the height of the 4NF fragment tree Since any fragment of D C (G) maps to a path
in the 4NF fragment tree, every tuple of D C (T ) maps to a
path in the record tree The number of path’s in the record tree
P (R) can be computed by the following recursive expression:
P (R) = N(R) + P (R1) + + P (R n ), where N(R) is the
number of nodes in the record tree and stands for all the paths
that start at the root R i’s denote subtrees rooted at the children
of the root The maximum depth of the recursion is h At each
level of the recursion, after the first one, the total number of
added paths is less than N Thus P (R) < hN.
Multiple tuples of D C (T ) may map to the same path in the record tree, because each tuple of D C (T ) is a result of some outerjoin of tuples of D 4NF (T ), and the same tuple may be
a result of multiple outer joins (e.g A 1 B = A 1 B
1 C, if C is empty.) However the same tuple cannot be a
result of more than n distinct left outerjoins Thus |D C (G)| ≤
P (R) ∗ n By definition |D 4NF (G)| = N; hence |D C (G)| <
|D 4NF (G)| ∗ h ∗ n 4.1 BLOBs
To speed up construction of the XML results from the lational result-sets XCacheDB stores a binary image of pre-parsed XML subtrees as Binary Large OBjects (BLOBs) Thebinary format is optimized for efficient navigation and print-ing of the XML fragments The fragments are stored in specialBLOBs tables that use node ID’s as foreign keys to associatethe XML fragments to the appropriate data elements
re-By default, every subtree of the document except the trivialones (the entire document and separate leaf elements) is stored
in the Blobs table This approach may have unnecessarily high
space overhead because the data gets replicated up to H − 2
times, where H is the depth of the schema tree We reduce the overhead by providing a graphical utility, the XCacheDB
Loader, which allows the user to control which schema nodes
get “BLOB-ed", by annotating the XML Schema The usershould BLOB only those elements that are likely to be returned
by the queries For example, in the decomposition of Fig 10
Trang 39Result {$N,$O}
$O
Result Tree Customers
root
Fig 13 XML query notation
onlyOrderandPreferredSupplierelements were
cho-sen to be BLOB-ed, as indicated by the boxes.Customer
elements may be too large and too infrequently requested by
a query, whileLineItem is small and can be constructed
quickly and efficiently without BLOB’s
We chose not to store Blobs in the same tables as data to
avoid unnecessary increase in table size, since Blob structures
can be fairly large In fact, a Blob has similar size to the XML
subtree that it encodes The size of an XML document (without
the header and whitespace) can be computed as
where E N is the number of elements, E Sizeis the average size
of the element tag, T N is how many elements contain text (i.e
leafs) and T Sizeis the average text size The size of a BLOB
is:
BLOB Size = E N ∗ (E Size + 10) + T N ∗ (T Size+ 3)
The separate Blobs table also gives us an option of using
a separate SQL query to retrieve Blobs which improves the
query response time
5 XML query processing
We represent XML queries with a tree notation similar to
loto-ql [29] The query notation facilitates explanation of query
pro-cessing and corresponds to FOR-WHERE-RETURN queries
of the XQuery standard [39]
Definition 12 (Query) A query is a tuple C, E, R, where C
is called condition tree, E is called condition expression, and
• Element nodes that are labeled with labels from L.
Each element node n may also be labeled with a
• Union nodes The same set of variables must occur
in all children subtrees of a Union node Two nodes
cannot be labeled with the same variable, unless their
lowest common ancestor is a Union node.
connectives, constants, and variables that occur in C.
leaf nodes are labeled either with variables that occur in
“group-by" labels consisting of one or more variables that occur
group-by label of l or the group-by label of an ancestor of l.
♦
The query semantics are based on first matching the tion tree with the XML data to obtain bindings and then usingthe result tree to structure the bindings into the XML result.The semantics of the condition tree are defined in twosteps First, we remove Union nodes and produce a for-
est of conjunctive condition trees, by traversing the
condi-tion tree bottom-up and replacing each Union node deterministically by one of its children This process is similar
non-to producing a disjunctive normal form of a logical expression.Set of bindings produced by the condition tree is defined as aunion of sets of bindings produced by each of the conjunctivecondition trees
Formally, let C be a condition tree of a query and t be the XML document tree Let V ar(C) be the set of variables in C Let C1 C l be a set of all conjunctive condition trees of C Note that V ar (C) = V ar(C i ), ∀i ∈ [1, l].A variable binding
ˆβ maps each variable of V ar(C) to a node of t The set of
variable bindings is computed based on the set of condition tree
bindings A condition tree binding β maps each node n of some conjunctive condition tree C i to a node of t The condition tree binding is valid if β(root(C i )) = root(t) and recursively, traversing C depth-first left-to-right, for each child c jof a node
c ∈ C i , assuming c is mapped to x ∈ t, there exists a child x j
of x such that β(c j )) = x j and label(c j ) = label(x j).The set of variable bindings consists of all bindings ˆβ =
binding β = [c1 → x1, , c n → x n , ], such that V1 =
V ar (c1), , V n = V ar(c n)
The condition expression E is evaluated using the
bind-ing values and if it evaluates to true, the variable bindbind-ing isqualified Notice that the variables bind to XML elements andnot to their content values In order to evaluate the conditionexpression, all variables are coerced to the content values ofthe elements to which they bind For example, in Fig 13 the
variable P binds to an XML element “price" However, when
evaluating the condition expression we use the integer value
of “price"
Once a set of qualified bindings is identified, the resultingXML document tree is constructed by structural recursion on
the result tree R as follows The recursion starts at the root
of R with the full set of qualified bindings B Traversing R top-down, for each sub-tree R(n) rooted at node n, given a partial set of bindings B (we explain how B gets constructed
next) we construct a forest F (n, B ) following one of the casesbelow:
Label: If n consists of a tag label L without a
group-by label, the result is an XML tree with root
labeled L The list of children of the root is the concatenation F (n1, B )# #F (n m , B ), where
chil-dren, the partial set of bindings is B = B .
Trang 40Fig 14 The XQuery equivalent to the query of Fig 13
Group-By: If n is of the form L {V1, , V k }, where
XML tree T v1, ,v k for each distinct set v1, , v kof
val-ues of V1, , V k in B Each T v1, ,v khas its root labeled
L The list of children of the root is the concatenation
is the set of variables that occur in the tree rooted at n i
Leaf Group-By: If n is a leaf node of form
V {V1, , V k }, the result is a list of values of V , for each
distinct set v1, , v k of values of V1, , V k in B .
Leaf Variable: If n is a single variable V , and V binds to an
element E in B , the result is E If the query plan is valid,
B will contain only a single tuple.
The result of the query is the forest F (r, B), where r is the
root of the result tree and B is the set of bindings delivered by
the condition tree and condition expression However, since
in our work we want to enforce that the result is a single XML
tree, we require that r does not have a “group-by" label.
Example 4 The condition tree and expression of the query
of Fig 13 retrieve tuplesN, O where N is the Name
ele-ment of a Customer eleele-ment with an Order O that has at least
one LineItem that has Price greater than 30000 For each
and the O This is essentially query number 18 of the TPC-H
benchmark suite [7], modified not to aggregate across
lineit-ems of the order It is equivalent to the XQuery of Fig 14
For example, if the query is executed on data of Fig 2,
the following set of bindings is produced, assuming that the
Order elements are BLOB-ed.
Numbers in subscript indicate node ID’s of the elements;
square brackets denote values of atomic elements and
subele-ments of complex elesubele-ments First, a single root element is created Then, the group-by on the Result node partitions the bindings into two groups (for Order3 and Order13), and
creates a Result element for each group The second group-by creates two Order elements from the following two sets of
Result102[Name29[‘‘Customer1"],Order13[ .]
]]
We can extend our query semantics to ordered XMLmodel To support order-preserving XML semantics, group-
by operators will produce lists, given sorted lists of sourcebindings In particular the group-by operator will order theoutput elements according to the node ID’s of the bindings ofthe group-by variables For example, the group-by in query ofFig 13 will produces lists of pairs of names and orders, sorted
by name ID and order ID
5.1 Query processing
Figure 15 illustrates the typical query processing steps lowed by XML databases built on relational databases; the ar-chitecture of XCacheDB is indeed based on the one of Fig 15
fol-The plan generator receives an XML query and a schema composition It produces a plan, which consists of the con-
de-dition tree, the conde-dition expression, the plan decomposition,
Constructor XML Results
Schema Decompositoin
Schema Info
RDBMS
Data Storage
Fig 15 Query processing architecture