Tài liệu Managing time in relational databases- P20 doc

Asserted Versioning’s Non-Unique Primary Keys First, and most obviously, each parent table needs an index whose initial column will be that table’s object identifier oid.. The semantical

Trang 1

SELECT data FROM mytable WHERE SSN ¼ :my-ssn AND eff_beg_dt <¼ :my-as-of-dt AND eff_end_dt > :my-as-of-dt AND asr_beg_dt <¼ :my-as-of-dt AND assertion end date > :my-as-of-dt AND circa_asr_flag IN (‘Y’, ‘N’)

In processing this query, a DB2 optimizer will first match on SSN After that, still using the index tree rather than a scan, it will look aside for the effective end date under the ‘Y’ value for the circa flag, and then repeat the process for the ‘N’ value This uses

a matchcols of three; whereas without the IN clause, an index scan would begin right after the SSN match However, we only recommend this for SQL where :my_as_of_dt is not guaranteed

to be Now() When that as-of date is Now(), using the EQUALS predicate ({circa_asr_flag ¼ ‘Y’}) will perform much better since the ‘N’s do not need to be analyzed

Query-enhancing indexes like these are not always needed For the most part, as we said earlier, these indexes are specifi-cally designed to improve the performance of queries that are looking for the currently asserted current versions of the objects they are interested in, and in systems that require extremely high read performance

Indexes to Optimize Temporal Referential Integrity Temporal referential integrity (TRI) is enforced in two direct-ions On the insert or temporal expansion of a child managed object, or on a change in the parent object designated by its tem-poral foreign key, we must insure that the parent object is pres-ent in every clock tick in which the child object is about to be present On the deletion or temporal contraction of a parent managed object, we must RESTRICT, CASCADE or SET NULL that transformation so that it does not leave any “temporal orphans” after the transaction is complete

In this section, we will discuss the performance con-siderations involved in creating indexes that support TRI checks

on both parent and child managed objects

Asserted Versioning’s Non-Unique Primary Keys First, and most obviously, each parent table needs an index whose initial column will be that table’s object identifier (oid) The object identifier is also the initial column of the primary

Trang 2

key (PK) of all asserted version tables It is followed by two other

primary key components, the effective begin date and the

asser-tion begin date

We need to remember that these physical PKs do not

explic-itly define the logical primary keys used by the AVF because

the AVF uses date ranges and not specific dates or pairs of dates

Because of this, a unique index on the primary key of an asserted

version table does not guarantee temporal entity integrity These

primary keys guarantee physical uniqueness; they guarantee that

no two rows will have identical primary key values But they do

not guarantee semantic uniqueness, because they do not prevent

multiple rows with the same object identifier from specifying

[overlapping] or otherwise [intersecting] time periods

The PK of an asserted version table can be any column or

combination of columns that physically distinguish each row

from all the other rows in the table For example, the PK could

be the object identifier plus a sequence number It could be a

single surrogate identity key column It could be a business key

plus the row create date We have this freedom of choice because

asserted version tables more clearly distinguish between

seman-tically unique identifiers and physically unique identifiers than

do conventional tables

But this very freedom of choice poses a serious risk to any

business deciding to implement its own Asserted Versioning

framework It is the risk of implementing Asserted Versioning’s

concepts one project at a time, one database at a time, one set

of queries and maintenance transactions at a time It is the risk

of proliferating point solutions, each of which may work

correctly, but which together pose serious difficulties for queries

which range across two or more of those databases It is the risk

of failing to create an enterprise implementation of bi-temporal

data management

The semantically unique identifier for any asserted version

table is the combination of the table’s object identifier and its

two time periods And to emphasize this very important point

once again: two pairs of dates are indeed used to represent two

time periods, but they are not equivalent to two time periods

What turns those pairs of dates into time periods is the Asserted

Versioning code which guarantees that they are treated as the

begin and end delimiters for time periods

Given that there should be one enterprise-wide approach for

Asserted Versioning primary keys, what should it be? First of all,

an enterprise approach requires that the PK of an asserted

ver-sion table must not contain any business data The reason is that

if business data were used, we could not guarantee that the same

Trang 3

number of columns would be used as the PK from one asserted version table to the next, or that column datatypes would even

be the same These differences would completely eliminate the interoperability benefits which are one of the objectives of an enterprise implementation But beyond that restriction, the choice of an enterprise standard for Asserted Versioning primary keys, in a proprietary implementation of Asserted Versioning concepts, is up to the organization implementing it

We have now shown how the choice of columns beyond the object identifier—the choice of the effective end date and the assertion end date, and optionally a circa flag—is used to mini-mize scan costs in both indexes and the tables they index We next consider, more specifically, indexes whose main purpose is to support the checks which enforce temporal referential integrity Indexes on TRI Parents

As we have explained, a temporal foreign key (TFK) never contains a full PK value So it never points to a specific parent row This is the principal way in which it is different from a con-ventional foreign key (FK), and the reason that current DBMSs cannot enforce temporal referential integrity

A complete Asserted Versioning temporal foreign key is a combination of a column of data and a function That column

of data contains the object identifier of the object on which the child object is existence-dependent That function interprets pairs of dates on the child row being created (by either an insert

or an update transaction) as time periods, and pairs of dates on the parent episode as time periods With that information, the AVF enforces TRI, insuring that any transformation of the data-base will leave it in a state in which the full extent of a child vers-ion’s time periods are included within its parent episode’s time periods It also enforces temporal entity integrity (TEI), insuring that no two rows representing the same object ever share a pair

of assertion time and effective time clock ticks

The AVF needs an index on the parent table to boost the formance of its TRI enforcement code We do not want to per-form scans while trying to determine if a parent object identifier exists, and if the effective period of the dependent is included within a single episode of the parent The most impor-tant part of this index on the parent table is that it starts with the object identifier

The AVF uses the object identifier and three temporal dates First, it uses the parent table’s episode begin date, rather than its effective begin date, because all TRI time period comparisons are between a child version and a parent episode So we will

Trang 4

consider the index sequence as described earlier to reduce scans,

but then add the episode begin date

Instead of creating a separate index for TRI parent-side

tables, we could try to minimize the number of indexes by

re-using the primary key index to:

(i) Support uniqueness for a row, because some DBMS

applications require a unique PK index for single-row

identification

(ii) Help the AVF perform well when an object is queried by its

object identifier; and

(iii) Improve performance for the AVF during TRI enforcement

So we recommend an index whose first column is the

object identifier of the parent table Our proposed index is

now {oid, } Next, we need to determine if we expect

cur-rent data reads to the table to outnumber non-curcur-rent reads or

updates

If we expect current data reads to dominate, then the next

column we might choose to use is the circa flag If this flag is

used as a higher-level node in the index, then TRI maintenance

in the AVF can use the {circa_asr_flag ¼ ‘Y’} predicate to ignore

most of the rows in past assertion time This could significantly

help the performance of TRI maintenance Using the circa flag,

our proposed index is now {oid, circa_asr_flag .} The

assumption here is that the DBMS allows updates to a PK value

with no physical foreign key dependents because the circa flag

will be updated

Just as in any physical data modeling effort, the DBA or Data

Architect will need to analyze the tradeoffs of indexing for reads

vs indexing for updates The decision might be to replace a

sin-gle multi-use index with several indexes each supporting a

dif-ferent pattern of access But in constructing an index to help

the performance of TRI enforcement, the next column should

be the effective end date, for the reasons described earlier in this

chapter Our proposed index is now {oid, circa_asr_flag,

eff_end_dt, }

After that, the sequence of the columns doesn’t matter much

because the effective end date is used with a range predicate, so

direct index matching stops there However, other columns are

needed for uniqueness, and the optimizer will still likely use

any additional columns that are in the index and qualified as

criteria, filtering on everything it got during the index scan rather

than during the more expensive table scan

If the circa flag is not included in the index, and the DBMS

allows the update of a primary key (with no physical

dependents), then the next column should be the assertion end

Trang 5

date Otherwise, the next column should be the assertion begin date In either case, we now have a unique index, which can be used as the PK index, for queries and also for TRI enforcement Finally, to help with TRI enforcement, we recommend adding the episode begin date This is because the parent managed object in any TRI relationship is always an episode

Depending on whether or not the circa flag is included, this unique index is either

{oid, circa_asr_flag, eff_end_dt, asr_beg_dt, epis_beg_dt}

or

{oid, eff_end_dt, asr_end_dt, epis_beg_dt}

Let’s be sure we understand why both indexes are unique The unique identifier of any object is the combination of its oid, asser-tion time period and effective time period In the primary key of asserted version tables, those two time periods are represented

by their respective begin dates But because the AVF enforces temporal entity integrity, no two rows for the same object can share both an assertion clock tick and an effective clock tick So

in the case of these two indexes, while the assertion begin date represents the assertion time period, the effective end date represents the effective time period Both indexes contain an object identifier and one delimiter date representing each of the two time periods, and so both indexes are unique

Indexes on TRI Children Some DBMSs automatically create indexes for foreign keys declared to the DBMS, but others do not Regardless, since Asserted Versioning does not declare its temporal foreign keys using SQL’s Data Definition Language (DDL), we must create our own indexes to improve the performance of TRI enforce-ment on TFKs

Each table that contains a TFK should have an index on the TFK columns primarily to assist with delete rule enforcement, such as

ON DELETE RESTRICT, CASCADE or SET NULL These indexes can be multi-purpose as well, also being used to assist with general queries that use the oid value of the TFK We should try to design these indexes to support both cases in order to minimize the system overhead otherwise required to maintain multiple indexes

When a temporal delete rule is fired from the parent, it will look at every dependent table that uses the parent’s oid It will also use the four temporal dates to find rows that fall within the assertion and effective periods of the related parent

Trang 6

The predicate to find dependents in any contained clock tick

would look something like this:

WHERE parent_oid ¼ :parent-oid

AND eff_beg_dt < :parent-eff-end-dt

AND eff_end_dt > :parent-eff-beg-dt

AND circa_asr_flag ¼ ‘Y’ (if used)

AND asr_end_dt >¼ Now()

(might have deferred assertion criteria, too)

In this SQL, the columns designated as parent dates are the

effective begin and end dates specified on the delete

transaction

In an index designed to enhance the performance of the

search for TRI parent–child relationships, the first column

should be the TFK This is the oid value that relates a child to a

parent

Temporal referential integrity checks are never concerned

with withdrawn assertions, so this is another index in which

the circa flag will help performance If we use this flag, it should

be the next column in the index However, if this is the column

that will be used for clustering or partitioning, the circa flag

should be listed first, before the oid

For TRI enforcement, the AVF does not use a simple

BETWEEN predicate because it needs to find dependents with

any overlapping clock ticks Instead, it uses an [intersects]

predicate

Two rules used during TRI delete enforcement are that the

effective begin date on the episode must be less than the

effec-tive end date specified on the delete transaction, and that the

effective end date on the episode must be greater than the

effec-tive begin date on the transaction

Earlier, we pointed out that for current data queries, there are

usually many more historical rows than current and future rows,

and for that reason we made the next column the effective end

date rather than the effective begin date These same

con-siderations hold true for indexes assisting with temporal delete

transactions

Therefore, our recommended index structure for TFK

indexes, which can be used for both TRI enforcement by the

AVF, and also for any queries looking for parent object and

child object relationships, where the oid mentioned is the TFK

value, is either {parent_oid, circa_asr_flag, eff_end_dt .} or

{parent_oid, eff_end_dt, asr_end_dt .}

Other temporal columns could be added, depending on

application-specific uses for the index

Trang 7

Other Techniques for Performance Tuning Bi-Temporal Tables

In an Asserted Versioning database, most of the activity is row insertion No rows are physically deleted; and except for the update of the assertion end date when an assertion is with-drawn, or the update of the assertion begin date when far future deferred assertions are moved into the near future, there are no physical updates either On the other hand, there are plenty of reads, usually to current data We need to consider these types

of access, and their relative frequencies, when we decide which optimization techniques to use

Avoiding MAX(dt) Predicates Even if Asserted Versioning did not support logical gap versioning, we would keep both effective end dates and assertion end dates in the Asserted Versioning bi-temporal schema The reason is that, without them, most accesses to these tables would require finding the MAX(dt) of the designated object in assertion time, or in effective time within a specified period of assertion time The performance problem with a MAX(dt) is that it needs

to be evaluated for each row that is looked at, causing perfor-mance degradation exponential to the number of rows reviewed Experience with the AVF and our Asserted Versioning databases has shown us that eliminating MAX(dt) subqueries and having effective and assertion end dates on asserted version tables, dramatically improves performance

NULL vs 12/31/9999 Some readers might wonder why we do not use nulls to stand

in for unknown future dates, whether effective end dates or assertion end dates From a logical point of view, NULL, which

is a marker representing the absence of information, is what

we should use in these date columns whenever we do not know what those future dates will be

But experience with the AVF and with Asserted Versioning databases has shown that using real dates rather than nulls helps the optimizer to consistently choose better, more efficient access paths, and matches on index keys more directly

Without using NULL, the predicate to find versions that are still in effect is:

eff_end_dt > Now()

Trang 8

Using NULL, the semantically identical predicate is:

(eff_end_dt > Now() OR eff_end_dt IS NULL)

The OR in the second example causes the optimizer to try

one path and then another It might use index look-aside, or it

might scan Either of these is less efficient than a single

GREATER THAN comparison

Another considered approach is to coalesce NULL and the latest

date recognizable by the DBMS, giving us the following predicate:

COALESCE(eff_end_dt, ‘12/31/9999’) > Now()

But functions normally cannot be resolved in standard

indexes, and so the COALESCE function will normally cause a

scan Worse yet, some DBMSs will not resolve functions until

all data is read and joined So frequently, a lot of extra data will

be assembled into a preliminary result set before this COALESCE

function is ever applied

The last of our three options is a simple range predicate (such

as GREATER THAN) without an OR, and without a function If

the end date is unknown, and the value we use to represent that

unknown condition is the highest date (or timestamp) which the

DBMS can recognize, then this simple range predicate will

return the same results as the other two predicates And given

that the highest date a DBMS can recognize is likely to be far into

the future, it is unlikely that business applications will ever need

to use that date to represent that far-off time In SQL Server, for

example, that highest date is 12/31/9999 So as long as our

busi-ness applications do not need to designate that specific New

Year’s Eve nearly 8000 years from now, we are free to use it to

represent the fact that a value is unknown Using it, we can use

the simple range predicate shown earlier in this section, and

reap the benefits of the excellent performance of that kind of

predicate

Partitioning

Another technique that can help with performance and database

maintenance, such as backups, recoveries and reorganizations, is

partitioning There are several basic approaches to partitioning

One is to partition by a date, or something similar, so that the

more current and active data is grouped together, and is more

likely to be found in cache This is a common partitioning

strat-egy for on-line transaction processing systems

Another is to partition by some known field that could keep

commonly accessed smaller groups of data together, such as a

Trang 9

low cardinality foreign key The benefit of this approach is that it directs a search to a small well-focused collection of data located

on the same or on adjacent I/O pages This strategy improves performance by taking advantage of sequential prefetch algorithms

A third approach is to partition by some random field to take advantage of the parallelism in data access that some DBMSs support For these DBMSs, the partitions define parallel access paths This is a good strategy for applications such as reporting and business intelligence (BI) where typically large scans could benefit from the parallel processing made possible by the partitioning

Some DBMSs require that the partitioning index also be the clustering index This limits options because it forces a trade-off between optimizing for sequential prefetch and optimizing for parallel access Fortunately, DBMS vendors are starting to separate the implementation of these two requirements

Another limitation of some DBMSs, but one that is gradually being addressed by their vendors, is that whenever a row is moved between partitions, those entire partitions are both locked This forces application developers to design their pro-cesses so that they never update a partitioning key value on a row during prime time, because doing so locks the source and destination partitions until the move is complete As we noted, more recent releases of DBMSs reduce the locking required to move a row from one partition to another

A good partitioning strategy for an Asserted Versioning data-base is to partition by one of the temporal columns, such as the assertion end date, in order to keep the most frequently accessed data in cache As we have pointed out, that will nor-mally be currently asserted current versions of the objects of interest to the query

For an optimizer to know which partition(s) to access, it needs to know the high order of the key For direct access to the other search criteria, it needs direct access to the higher nodes in the key, higher than the search key Therefore, while one of the temporal dates is good for partitioning, it reduces the effectiveness of other search criteria To avoid this problem,

we might want to define two indexes, one for partitioning, and another for searching

The better solution for defining partitions that optimize access to currently asserted versions is to use the circa flag as the first column in the partitioning index The best predicate would be {circa_asr_flag ¼ ‘Y’} for current assertions For DBMSs which support index-look-aside processing for IN predicates, the

Trang 10

best predicate might be {circa_asr_flag IN (‘Y’, ‘N’)} when it is

uncertain if the version is currently asserted With this predicate,

the index can support searches for past assertions as well as

searches for current ones Otherwise, it will require a separate

index to support searches for past assertions

Clustering

Clustering and partitioning often go together, depending on

the reason for partitioning and the way in which specific DBMSs

support it Whether or not partitioning is used, choosing the best

clustering sequence can dramatically reduce I/O and improve

performance

The general concept behind clustering is that as the database

is modified, the DBMS will attempt to keep the data on physical

pages in the same order as that specified in the clustering index

But each DBMS does this a little differently One DBMS will

clus-ter each time an insert or update is processed Another will make

a valiant attempt to do that A third will only cluster when the

table is reorganized But regardless of the approach, the result

is to reduce physical I/O by locating data that is frequently

accessed together as physically close together as possible

Early DBMSs only allowed one clustering index, but newer

releases often support multiple clustering sequences, sometimes

called indexed views or multi-dimensional clustering

It is important to determine the most frequently used access

paths to the data Often the most frequently used access paths

are ones based on one or more foreign keys For asserted version

tables, currently asserted current versions are usually the most

frequently queried data

Sometimes, the right combination of foreign keys can provide

good clustering for more than one access path For example,

suppose that a policy table has two low cardinality TFKs,

prod-uct type and market segment, and that each TFK value has

thousands of related policies.3We might then create this

cluster-ing index:

{circa_asr_flag, product_type_oid, market_segment_oid,

eff_end_dt, policy_oid}

The circa flag would cluster most of the currently asserted

rows together, keeping them physically co-located under the

lower cardinality columns Clustering would continue based on

3 Low cardinality means that there are fewer distinct values for the field in the table

which results in more rows having a single value.

Định dạng
Số trang	20
Dung lượng	124,31 KB