Tài liệu Managing time in relational databases- P19 pdf

But even with today’s SQL which lacks these temporal extensions, Asserted Versioning manages assertion and effective time date pairs as user-defined PERIOD datatypes, and supports all th

Trang 1

AND c.eff_beg_dt <¼ cl.row_crt_dt AND c.eff_end_dt > cl.row_crt_dt AND c.asr_beg_dt <¼ cl.row_crt_dt AND c.asr_end_dt > cl.row_crt_dt WHERE cl.claim_amt > p.copay_amt ORDER BY cl.adjud_dt, c.client_nbr, p.policy_nbr, p.eff_beg_dt;

To conclude this section, we show what this query might look like if the SQL language supported PERIOD datatypes, and also our taxonomy of Allen relationships We suppose that the taxon-omy node [fills1] is represented by the reserved word INCLUDES With a SQL language like this, the Asserted Versioning schema no longer has pairs of dates to represent its two time periods Instead,

it has the single columns asr_per and eff_per

SELECT c.client_nbr, c.client_nm,

p.policy_nbr, p.policy_type, p.copay_amt, cl.service_dt, cl.claim_amt, cl.adjud_dt FROM Claim cl

INNER JOIN Policy_AV p

ON p.policy_oid ¼ cl.policy_oid AND p.eff_per INCLUDES cl.service_dt AND p.asr_per INCLUDES cl.adjud_dt INNER JOIN Client_AV c

ON c.client_oid ¼ p.client_oid AND c.eff_per INCLUDES cl.row_crt_dt AND c.asr_per INCLUDES cl.row_crt_dt WHERE cl.claim_amt > p.copay_amt ORDER BY cl.adjud_dt, c.client_nbr, p.policy_nbr, p.eff_beg_dt;

In either form, what is striking about the query is its simplicity relative to the complexity of the bi-temporal semantics that under-lies it Unlike queries in the standard temporal model and, for that matter, uni-temporal queries in the alternative temporal model as well, this query does not assemble a collection of rows and then proceed to check for temporal gaps and temporal overlaps within sub-selected collections of those rows Asserted Versioning enforces bi-temporal semantics once, as the data is being created and modified, rather than each time the data is queried

In Other Words With appropriate temporal extensions to the SQL language, the expression of all thirteen Allen relationships, and of this and other relationships which are combinations of those

Trang 2

thirteen relationships, would be greatly simplified The first

thing that is needed to support predicates for these relationships

is to provide a PERIOD datatype, as we discussed in Chapter 3

With that datatype available, SQL could express each of the

relationships we have discussed with one binary predicate

relat-ing two time periods (not two pairs of dates)

For example, instead of having to request data associated

with two time periods such that the first starts before the second

and ends after the second starts but before the second ends, we

could simply request data associated with two time periods such

that the first [overlaps] the second

Or, instead of having to request data associated with two time

periods such that the first doesn’t start after the second and doesn’t

end before the second, we could simply request data associated

with two time periods such that the first [fills] the second

It is clearly easier to think about what information one

wants from the database at the higher level of abstraction

provided by this new datatype and these new relationships,

rather than at the level of abstraction in which begin and end

dates have to be used, as they are in the original formulation

of the example And it is just as clearly easier to write the

corresponding SQL

But even with today’s SQL which lacks these temporal

extensions, Asserted Versioning manages assertion and effective

time date pairs as user-defined PERIOD datatypes, and supports

all the Allen relationships as well as the other relationships in

our Allen relationship taxonomy Asserted Versioning thus

pro-vides a migration path to the day when these extensions are

supported in the SQL standard and in commercial DBMSs

Glossary References

Glossary entries whose definitions form strong

inter-dependencies are grouped together in the following list The

same glossary entries may be grouped together in different ways

at the end of different chapters, each grouping reflecting the

semantic perspective of each chapter There will usually be

sev-eral other, and often many other, glossary entries that are not

included in the list, and we recommend that the Glossary be

consulted whenever an unfamiliar term is encountered

We note, in particular, that none of the nodes in the Asserted

Versioning taxonomy of Allen relationships are included in this

list In general, we leave taxonomy nodes out of these lists since

they are long enough without them

Trang 3

Allen relationships Asserted Versioning Framework (AVF) episode

clock tick closed-open contiguous granularity effective begin date effective end date object

PERIOD datatype point in time time period temporal entity integrity (TEI) temporal referential integrity (TRI) the alternative temporal model the standard temporal model version

Trang 4

OPTIMIZING ASSERTED

VERSIONING DATABASES

Bi-Temporal, Conventional, and Non-Temporal Databases 350

Data Volumes in Bi-Temporal and in Conventional Databases 350

Response Times in Bi-Temporal and in Conventional Databases 351

The Optimization Drill: Modify, Monitor, Repeat 351

Performance Tuning Bi-Temporal Tables Using Indexes 352

General Considerations 353

Indexes to Optimize Queries 354

Indexes to Optimize Temporal Referential Integrity 366

Other Techniques for Performance Tuning Bi-Temporal Tables 372

Avoiding MAX(dt) Predicates 372

NULL vs 12/31/9999 372

Partitioning 373

Clustering 375

Materialized Query Tables 376

Standard Tuning Techniques 377

Glossary References 378

One concern about Asserted Versioning is with how well

it will perform We believe that with recent improvements in

technology, and with the use of the physical design techniques

described in this chapter, Asserted Versioning databases can

achieve performance very close to that of conventional

databases This is especially true for queries, which are

usually the most frequent kind of access to any relational

database The AVF, our own implementation of Asserted

Versioning, is designed to operate well with large data volume

databases supporting a high volume of mixed-type data retrieval

requests

Managing Time in Relational Databases Doi: 10.1016/B978-0-12-375041-9.00015-7

Copyright # 2010 Elsevier Inc All rights of reproduction in any form reserved. 349

Trang 5

Bi-Temporal, Conventional, and Non-Temporal Databases

In this section, we compare data volumes and response times

in bi-temporal and in conventional databases We find that differences in both data volumes and response times are gener-ally quite small, and are usugener-ally not good reasons for hesitating

to implement bi-temporal data in even the largest databases of the world’s largest corporations

Data Volumes in Bi-Temporal and in Conventional Databases

It might seem that a bi-temporal database will have a lot more data in it than a conventional database, and will conse-quently take a lot longer to process It is true that the size of a bi-temporal database will be larger than that of an otherwise identical database which contains only current data about per-sistent objects But in our consulting engagements, which span several decades and dozens of clients, we have found that in most mission-critical systems, temporal data is jury-rigged into ostensibly non-temporal databases

There are any number of ways that this may happen For example, in some systems a version date is added to the primary key of selected tables In other systems, more advanced forms of best practice versioning (as described in Chapter 4) are employed Sometimes, history will be captured by triggering an insert into a history table every time a particular non-temporal table is modified Another approach is to generate a series of periodic snapshot tables that capture the state of a non-temporal table at regular intervals

Of course, a database with no temporal data at all will certainly be smaller than the same database with temporal data But adding up the overhead associated with embedded best practice versioning, or with triggered history, periodic snapshots or some combination of these and other techniques, the amount of data in a so-called non-temporal database may be as much or even more than the amount of data in a bi-temporal database

Throughout this book, we have been using the terms “non-temporal database” and “conventional database” as equivalent expressions But now we have a reason to distinguish them From now on, we will call a database “non-temporal” only if it

Trang 6

contains no temporal data about persistent objects at all.1 And

from now on, we will use the term “conventional database” to

refer to databases that may or may not contain temporal data

about persistent objects (and that usually do), but that do not

contain explicitly bi-temporal tables and instead incorporate

temporal data by using variations on one or more of the ad

hoc methods we have described

Response Times in Bi-Temporal and

in Conventional Databases

At the level of individual tables, a table lacking temporal

data will clearly have less data than an otherwise identical table

that also contains temporal data But even if a table has more

data than another table, it may perform nearly as well as that

other table because response times are usually not linear to the

amount of data in the target table

Response times will be approximately linear to the amount of

data in the table in the case of full table scans, but will almost never

be linear for direct access reads A direct (random) read to a table

with five million rows will perform almost as well as a direct read

to a table with only one million rows, provided that the table is

indexed properly and that the number of non-leaf index levels is

the same And, in most cases, they will be the same, or very close to it

In addition, when adding in the overhead of triggers of an

expo-nentially growing number of dependents, and of the often

ineffi-cient SQL used to access and maintain data in conventional

databases, it is likely that using the AVF to manage temporal data

in an Asserted Versioning database will prove to be a more efficient

method of managing temporal data than directly invoking DBMS

methods to manage temporal data in a conventional database

The Optimization Drill: Modify, Monitor,

Repeat

Performance optimization, also known as “performance

tun-ing”, is usually an iterative approach to making and then

moni-toring modifications to an application and its database It

1 The point of adding “about persistent objects”, of course, is to distinguish between

objects and events, as we did in our taxonomy in Chapter 2 So a “non-temporal

database”, in this new sense, may contain event tables, i.e tables of transactions And

it may also contain fact-dimension data marts What it may not contain is data about

any historical (or future) states of persistent objects.

Trang 7

could involve adjusting the configuration of the database and server, or making changes to the applications and the SQL that maintain and query the database As authors of this book, we can’t participate in the specific modify and monitor iterative pro-cesses being carried on by any of our readers and their IT organizations But we can describe factors that are likely to apply

to any Asserted Versioning implementation

These factors include the number of users, the complexity of the application and the SQL, the volatility of the data, and the DBMS and server platform The major DBMSs may optimize varying configurations differently, and may have extensions that can be used to simplify and improve a “plain vanilla” implemen-tation of Asserted Versioning

In this chapter, we will take a broad brush approach and, in general, discuss optimization techniques that apply to the temporalization of any relational database, regardless of what industry its owning organization is part of, and regardless of what types of applications it supports Each reader will need to review these recommendations and determine if and how they apply to specific databases and applications that she may be responsible for

To repeat once more as we read the following sections, although we use the term “date” in this book to describe the delimiters of assertion and effective time periods, those delimiters can actually be of any time duration, such as a day, minute, second or microsecond We use a month as the clock tick granu-larity in many of our examples But in most cases, a finer level of granularity will be chosen, such as a timestamp representing the smallest clock tick supported by the DBMS

Performance Tuning Bi-Temporal Tables Using Indexes

Many indexes are designed using something similar to a B-tree (balanced tree) structure, in which each node points to its next-level child nodes, and the leaf nodes contain pointers

to the desired data These indexes are used by working down from the top of the hierarchy until the leaf node containing the desired pointer is reached Each pointer is a specific index value paired with the physical address, page or row id of the row that matches that value From that point, the DBMS can

do a direct read and retrieve the I/O page that contains the desired data

Trang 8

B-tree indexes for bi-temporal tables work no differently

than B-tree indexes for non-temporal tables Knowing how

these indexes work, our design objective is to construct indexes

that will optimize the speed of access to the most frequently

accessed data In bi-temporal tables, we believe, that will

almost always be the currently asserted current versions of

the objects represented in those tables As index designers,

our task is two-fold First, we need to determine the best

columns to index on Then we need to arrange those columns

in the best sequence

General Considerations

The physical sequence of columns within an index has a

sig-nificant impact on the performance of queries that use that

index Our objective is to get to the desired row in a table with

the minimum amount of I/O activity against the index, followed

by a single direct read to the table itself So in determining the

sequence of columns in an index, a good idea is to put the most

frequently used lookup columns in the leftmost (initial) nodes of

the index These columns are often the columns that make up

the business key, or perhaps some other identifier such as the

primary key, or a foreign key

Against asserted version tables, most queries will be similar to

queries against non-temporal tables except that a few temporal

predicates will be added to the queries These temporal

pre-dicates eliminate rows whose assertion time periods and/or

effective time periods are not what the query is looking for

An object that is represented by exactly one row in a

non-temporal table may be represented by any number of rows in a

temporal table But for normal business use, the one current

row in the temporal table, i.e the row which corresponds to that

one row in the non-temporal table, is likely to be accessed much

more frequently than any of the other rows Unless we properly

combine temporal columns with non-temporal columns in the

index, access to that current row may require us to scan through

many past or future rows to get to it

Of course, we are talking about both a scan of index leaf

pages, as well as the more expensive scan of the table itself

When specific rows are being searched for, and when they may

or may not be clustered close to one another in physical storage,

we want to minimize any type of scan

Another important consideration in determining the optimal

sequence of columns in an index is that optimizers may decide

Trang 9

not to use a column in an index unless values have been provided for all the columns to its left, those being the columns that help to more directly trace a path through the higher levels

of the index tree, using the columns that match supplied pre-dicates So if we design an index with its temporal columns too far to the right, and with unqualified columns prior to them, a scan might still be triggered whenever the optimizer looks for the one current row for the object being queried On the other hand, as we will see, the solution is not to simply make the tem-poral columns left-most in the index

There will usually be many more non-current rows for an object, in an asserted version table, than the one current row for that object The table may contain any number of rows representing the history of the object, and any number of rows representing anticipated future states of the object The table may contain any number of no longer asserted rows for that object, as well as rows that we are not yet prepared to assert

So what we want the optimizer to do is to jump as directly as possible to the one currently asserted current version for an object, without having to scan though a potentially large number

of non-current rows

Indexes to Optimize Queries

Let’s look at an example We will assume that it is currently September 2011 So the next time the clock ticks, according to the clock tick granularity used in this book, it will be October 2011

In the table shown in Figure 15.1, there are nine rows representing the object whose object identifier is 55 Three of those rows are historical versions Their effectivity periods are past They represent past states of the object they refer to We designate them with “pe” (past effective) in the state column of the table.2

Another three of those rows are no longer asserted Their assertion periods are past They represent claims that we once made, claims that the statements which those rows made about the objects which they represented were true statements But now we no longer make those claims They exist in the assertion time past We designate these rows with “pa” (past asserted) in the state column of the table

2 The state and row # columns are not columns of the table itself They are metadata about the rows of the table, just like the row # column in the tables shown in other chapters in this book.

Trang 10

Two of those rows are not yet asserted They are deferred

assertions We are not yet willing to claim that the statements

made by those rows are true statements We designate these

rows with “fa” (future asserted) in the state column of the table

There is one current row representing the object whose

iden-tifier is 55 This row is currently asserted and, within current

assertion time, became effective in August 2009 and will remain

in effect until further notice Note, however, that it will remain

asserted only until October 2012 At that time, if nothing in the

data changes, the database will cease to say that the data for

object 55 is Kiwi from August 2009 until further notice Instead,

it will say that data for object 55 is Kiwi from August 2009 to

December 2013, and that from December 2013 until further

notice, it will be Grapes We designate this earlier, but current,

row with “cc” (currently asserted current version) in the state

metadata column of the table

The SQL to retrieve the one current row for object 55 is:

SELECT data

FROM mytable

WHERE oid ¼ 55

AND eff_beg_dt <¼ Now() AND eff_end_dt > Now()

AND asr_beg_dt <¼ Now() AND asr_end_dt > Now()

Most optimizers will use the index tree to locate the row id

(rid) of the qualifying row or rows using, first of all, the columns

that have direct matching predicates, such as EQUALS or IN,

columns which are sometimes called match columns These

optimizers will also use the index tree for a column with a range

predicate, such as BETWEEN or LESS THAN OR EQUAL TO

(<¼), provided that it is the first column in the index or the first

column following the direct match columns

state

pa

pe

pa

pe

pa

pe

cc

fa

1

2

3

4

5

6

7

8

9

55 Jan09 Jan09 Mar09 Mar09 Jun09 Jun09 Aug09 Aug09 Dec13

Jan09 Feb09 Feb09 Jun09 Jun09 Aug09 Aug09 Oct12 Oct12

Apples Apples Berries Berries Cherries Cherries Kiwi Kiwi Grapes

Feb09 9999 Jun09 9999 Aug09 9999 Oct12 9999 9999

9999 Mar09 9999 Jun09 9999 Aug09 9999 Dec13 9999

55 55 55 55 55 55 55 55 row # oid eff-beg eff-end asr-beg asr-end data

Figure 15.1 A Bi-Temporal Table

Tiêu đề	Allen relationship and other queries
Chuyên ngành	Relational Databases
Thể loại	Chapter

Định dạng
Số trang	20
Dung lượng	200,04 KB