Asserted Versioning’s Non-Unique Primary Keys First, and most obviously, each parent table needs an index whose initial column will be that table’s object identifier oid.. The semantical
Trang 1SELECT data FROM mytable WHERE SSN ¼ :my-ssn AND eff_beg_dt <¼ :my-as-of-dt AND eff_end_dt > :my-as-of-dt AND asr_beg_dt <¼ :my-as-of-dt AND assertion end date > :my-as-of-dt AND circa_asr_flag IN (‘Y’, ‘N’)
In processing this query, a DB2 optimizer will first match on SSN After that, still using the index tree rather than a scan, it will look aside for the effective end date under the ‘Y’ value for the circa flag, and then repeat the process for the ‘N’ value This uses
a matchcols of three; whereas without the IN clause, an index scan would begin right after the SSN match However, we only recommend this for SQL where :my_as_of_dt is not guaranteed
to be Now() When that as-of date is Now(), using the EQUALS predicate ({circa_asr_flag ¼ ‘Y’}) will perform much better since the ‘N’s do not need to be analyzed
Query-enhancing indexes like these are not always needed For the most part, as we said earlier, these indexes are specifi-cally designed to improve the performance of queries that are looking for the currently asserted current versions of the objects they are interested in, and in systems that require extremely high read performance
Indexes to Optimize Temporal Referential Integrity Temporal referential integrity (TRI) is enforced in two direct-ions On the insert or temporal expansion of a child managed object, or on a change in the parent object designated by its tem-poral foreign key, we must insure that the parent object is pres-ent in every clock tick in which the child object is about to be present On the deletion or temporal contraction of a parent managed object, we must RESTRICT, CASCADE or SET NULL that transformation so that it does not leave any “temporal orphans” after the transaction is complete
In this section, we will discuss the performance con-siderations involved in creating indexes that support TRI checks
on both parent and child managed objects
Asserted Versioning’s Non-Unique Primary Keys First, and most obviously, each parent table needs an index whose initial column will be that table’s object identifier (oid) The object identifier is also the initial column of the primary
Trang 2key (PK) of all asserted version tables It is followed by two other
primary key components, the effective begin date and the
asser-tion begin date
We need to remember that these physical PKs do not
explic-itly define the logical primary keys used by the AVF because
the AVF uses date ranges and not specific dates or pairs of dates
Because of this, a unique index on the primary key of an asserted
version table does not guarantee temporal entity integrity These
primary keys guarantee physical uniqueness; they guarantee that
no two rows will have identical primary key values But they do
not guarantee semantic uniqueness, because they do not prevent
multiple rows with the same object identifier from specifying
[overlapping] or otherwise [intersecting] time periods
The PK of an asserted version table can be any column or
combination of columns that physically distinguish each row
from all the other rows in the table For example, the PK could
be the object identifier plus a sequence number It could be a
single surrogate identity key column It could be a business key
plus the row create date We have this freedom of choice because
asserted version tables more clearly distinguish between
seman-tically unique identifiers and physically unique identifiers than
do conventional tables
But this very freedom of choice poses a serious risk to any
business deciding to implement its own Asserted Versioning
framework It is the risk of implementing Asserted Versioning’s
concepts one project at a time, one database at a time, one set
of queries and maintenance transactions at a time It is the risk
of proliferating point solutions, each of which may work
correctly, but which together pose serious difficulties for queries
which range across two or more of those databases It is the risk
of failing to create an enterprise implementation of bi-temporal
data management
The semantically unique identifier for any asserted version
table is the combination of the table’s object identifier and its
two time periods And to emphasize this very important point
once again: two pairs of dates are indeed used to represent two
time periods, but they are not equivalent to two time periods
What turns those pairs of dates into time periods is the Asserted
Versioning code which guarantees that they are treated as the
begin and end delimiters for time periods
Given that there should be one enterprise-wide approach for
Asserted Versioning primary keys, what should it be? First of all,
an enterprise approach requires that the PK of an asserted
ver-sion table must not contain any business data The reason is that
if business data were used, we could not guarantee that the same
Trang 3number of columns would be used as the PK from one asserted version table to the next, or that column datatypes would even
be the same These differences would completely eliminate the interoperability benefits which are one of the objectives of an enterprise implementation But beyond that restriction, the choice of an enterprise standard for Asserted Versioning primary keys, in a proprietary implementation of Asserted Versioning concepts, is up to the organization implementing it
We have now shown how the choice of columns beyond the object identifier—the choice of the effective end date and the assertion end date, and optionally a circa flag—is used to mini-mize scan costs in both indexes and the tables they index We next consider, more specifically, indexes whose main purpose is to support the checks which enforce temporal referential integrity Indexes on TRI Parents
As we have explained, a temporal foreign key (TFK) never contains a full PK value So it never points to a specific parent row This is the principal way in which it is different from a con-ventional foreign key (FK), and the reason that current DBMSs cannot enforce temporal referential integrity
A complete Asserted Versioning temporal foreign key is a combination of a column of data and a function That column
of data contains the object identifier of the object on which the child object is existence-dependent That function interprets pairs of dates on the child row being created (by either an insert
or an update transaction) as time periods, and pairs of dates on the parent episode as time periods With that information, the AVF enforces TRI, insuring that any transformation of the data-base will leave it in a state in which the full extent of a child vers-ion’s time periods are included within its parent episode’s time periods It also enforces temporal entity integrity (TEI), insuring that no two rows representing the same object ever share a pair
of assertion time and effective time clock ticks
The AVF needs an index on the parent table to boost the formance of its TRI enforcement code We do not want to per-form scans while trying to determine if a parent object identifier exists, and if the effective period of the dependent is included within a single episode of the parent The most impor-tant part of this index on the parent table is that it starts with the object identifier
The AVF uses the object identifier and three temporal dates First, it uses the parent table’s episode begin date, rather than its effective begin date, because all TRI time period comparisons are between a child version and a parent episode So we will
Trang 4consider the index sequence as described earlier to reduce scans,
but then add the episode begin date
Instead of creating a separate index for TRI parent-side
tables, we could try to minimize the number of indexes by
re-using the primary key index to:
(i) Support uniqueness for a row, because some DBMS
applications require a unique PK index for single-row
identification
(ii) Help the AVF perform well when an object is queried by its
object identifier; and
(iii) Improve performance for the AVF during TRI enforcement
So we recommend an index whose first column is the
object identifier of the parent table Our proposed index is
now {oid, } Next, we need to determine if we expect
cur-rent data reads to the table to outnumber non-curcur-rent reads or
updates
If we expect current data reads to dominate, then the next
column we might choose to use is the circa flag If this flag is
used as a higher-level node in the index, then TRI maintenance
in the AVF can use the {circa_asr_flag ¼ ‘Y’} predicate to ignore
most of the rows in past assertion time This could significantly
help the performance of TRI maintenance Using the circa flag,
our proposed index is now {oid, circa_asr_flag .} The
assumption here is that the DBMS allows updates to a PK value
with no physical foreign key dependents because the circa flag
will be updated
Just as in any physical data modeling effort, the DBA or Data
Architect will need to analyze the tradeoffs of indexing for reads
vs indexing for updates The decision might be to replace a
sin-gle multi-use index with several indexes each supporting a
dif-ferent pattern of access But in constructing an index to help
the performance of TRI enforcement, the next column should
be the effective end date, for the reasons described earlier in this
chapter Our proposed index is now {oid, circa_asr_flag,
eff_end_dt, }
After that, the sequence of the columns doesn’t matter much
because the effective end date is used with a range predicate, so
direct index matching stops there However, other columns are
needed for uniqueness, and the optimizer will still likely use
any additional columns that are in the index and qualified as
criteria, filtering on everything it got during the index scan rather
than during the more expensive table scan
If the circa flag is not included in the index, and the DBMS
allows the update of a primary key (with no physical
dependents), then the next column should be the assertion end
Trang 5date Otherwise, the next column should be the assertion begin date In either case, we now have a unique index, which can be used as the PK index, for queries and also for TRI enforcement Finally, to help with TRI enforcement, we recommend adding the episode begin date This is because the parent managed object in any TRI relationship is always an episode
Depending on whether or not the circa flag is included, this unique index is either
{oid, circa_asr_flag, eff_end_dt, asr_beg_dt, epis_beg_dt}
or
{oid, eff_end_dt, asr_end_dt, epis_beg_dt}
Let’s be sure we understand why both indexes are unique The unique identifier of any object is the combination of its oid, asser-tion time period and effective time period In the primary key of asserted version tables, those two time periods are represented
by their respective begin dates But because the AVF enforces temporal entity integrity, no two rows for the same object can share both an assertion clock tick and an effective clock tick So
in the case of these two indexes, while the assertion begin date represents the assertion time period, the effective end date represents the effective time period Both indexes contain an object identifier and one delimiter date representing each of the two time periods, and so both indexes are unique
Indexes on TRI Children Some DBMSs automatically create indexes for foreign keys declared to the DBMS, but others do not Regardless, since Asserted Versioning does not declare its temporal foreign keys using SQL’s Data Definition Language (DDL), we must create our own indexes to improve the performance of TRI enforce-ment on TFKs
Each table that contains a TFK should have an index on the TFK columns primarily to assist with delete rule enforcement, such as
ON DELETE RESTRICT, CASCADE or SET NULL These indexes can be multi-purpose as well, also being used to assist with general queries that use the oid value of the TFK We should try to design these indexes to support both cases in order to minimize the system overhead otherwise required to maintain multiple indexes
When a temporal delete rule is fired from the parent, it will look at every dependent table that uses the parent’s oid It will also use the four temporal dates to find rows that fall within the assertion and effective periods of the related parent
Trang 6The predicate to find dependents in any contained clock tick
would look something like this:
WHERE parent_oid ¼ :parent-oid
AND eff_beg_dt < :parent-eff-end-dt
AND eff_end_dt > :parent-eff-beg-dt
AND circa_asr_flag ¼ ‘Y’ (if used)
AND asr_end_dt >¼ Now()
(might have deferred assertion criteria, too)
In this SQL, the columns designated as parent dates are the
effective begin and end dates specified on the delete
transaction
In an index designed to enhance the performance of the
search for TRI parent–child relationships, the first column
should be the TFK This is the oid value that relates a child to a
parent
Temporal referential integrity checks are never concerned
with withdrawn assertions, so this is another index in which
the circa flag will help performance If we use this flag, it should
be the next column in the index However, if this is the column
that will be used for clustering or partitioning, the circa flag
should be listed first, before the oid
For TRI enforcement, the AVF does not use a simple
BETWEEN predicate because it needs to find dependents with
any overlapping clock ticks Instead, it uses an [intersects]
predicate
Two rules used during TRI delete enforcement are that the
effective begin date on the episode must be less than the
effec-tive end date specified on the delete transaction, and that the
effective end date on the episode must be greater than the
effec-tive begin date on the transaction
Earlier, we pointed out that for current data queries, there are
usually many more historical rows than current and future rows,
and for that reason we made the next column the effective end
date rather than the effective begin date These same
con-siderations hold true for indexes assisting with temporal delete
transactions
Therefore, our recommended index structure for TFK
indexes, which can be used for both TRI enforcement by the
AVF, and also for any queries looking for parent object and
child object relationships, where the oid mentioned is the TFK
value, is either {parent_oid, circa_asr_flag, eff_end_dt .} or
{parent_oid, eff_end_dt, asr_end_dt .}
Other temporal columns could be added, depending on
application-specific uses for the index
Trang 7Other Techniques for Performance Tuning Bi-Temporal Tables
In an Asserted Versioning database, most of the activity is row insertion No rows are physically deleted; and except for the update of the assertion end date when an assertion is with-drawn, or the update of the assertion begin date when far future deferred assertions are moved into the near future, there are no physical updates either On the other hand, there are plenty of reads, usually to current data We need to consider these types
of access, and their relative frequencies, when we decide which optimization techniques to use
Avoiding MAX(dt) Predicates Even if Asserted Versioning did not support logical gap versioning, we would keep both effective end dates and assertion end dates in the Asserted Versioning bi-temporal schema The reason is that, without them, most accesses to these tables would require finding the MAX(dt) of the designated object in assertion time, or in effective time within a specified period of assertion time The performance problem with a MAX(dt) is that it needs
to be evaluated for each row that is looked at, causing perfor-mance degradation exponential to the number of rows reviewed Experience with the AVF and our Asserted Versioning databases has shown us that eliminating MAX(dt) subqueries and having effective and assertion end dates on asserted version tables, dramatically improves performance
NULL vs 12/31/9999 Some readers might wonder why we do not use nulls to stand
in for unknown future dates, whether effective end dates or assertion end dates From a logical point of view, NULL, which
is a marker representing the absence of information, is what
we should use in these date columns whenever we do not know what those future dates will be
But experience with the AVF and with Asserted Versioning databases has shown that using real dates rather than nulls helps the optimizer to consistently choose better, more efficient access paths, and matches on index keys more directly
Without using NULL, the predicate to find versions that are still in effect is:
eff_end_dt > Now()
Trang 8Using NULL, the semantically identical predicate is:
(eff_end_dt > Now() OR eff_end_dt IS NULL)
The OR in the second example causes the optimizer to try
one path and then another It might use index look-aside, or it
might scan Either of these is less efficient than a single
GREATER THAN comparison
Another considered approach is to coalesce NULL and the latest
date recognizable by the DBMS, giving us the following predicate:
COALESCE(eff_end_dt, ‘12/31/9999’) > Now()
But functions normally cannot be resolved in standard
indexes, and so the COALESCE function will normally cause a
scan Worse yet, some DBMSs will not resolve functions until
all data is read and joined So frequently, a lot of extra data will
be assembled into a preliminary result set before this COALESCE
function is ever applied
The last of our three options is a simple range predicate (such
as GREATER THAN) without an OR, and without a function If
the end date is unknown, and the value we use to represent that
unknown condition is the highest date (or timestamp) which the
DBMS can recognize, then this simple range predicate will
return the same results as the other two predicates And given
that the highest date a DBMS can recognize is likely to be far into
the future, it is unlikely that business applications will ever need
to use that date to represent that far-off time In SQL Server, for
example, that highest date is 12/31/9999 So as long as our
busi-ness applications do not need to designate that specific New
Year’s Eve nearly 8000 years from now, we are free to use it to
represent the fact that a value is unknown Using it, we can use
the simple range predicate shown earlier in this section, and
reap the benefits of the excellent performance of that kind of
predicate
Partitioning
Another technique that can help with performance and database
maintenance, such as backups, recoveries and reorganizations, is
partitioning There are several basic approaches to partitioning
One is to partition by a date, or something similar, so that the
more current and active data is grouped together, and is more
likely to be found in cache This is a common partitioning
strat-egy for on-line transaction processing systems
Another is to partition by some known field that could keep
commonly accessed smaller groups of data together, such as a
Trang 9low cardinality foreign key The benefit of this approach is that it directs a search to a small well-focused collection of data located
on the same or on adjacent I/O pages This strategy improves performance by taking advantage of sequential prefetch algorithms
A third approach is to partition by some random field to take advantage of the parallelism in data access that some DBMSs support For these DBMSs, the partitions define parallel access paths This is a good strategy for applications such as reporting and business intelligence (BI) where typically large scans could benefit from the parallel processing made possible by the partitioning
Some DBMSs require that the partitioning index also be the clustering index This limits options because it forces a trade-off between optimizing for sequential prefetch and optimizing for parallel access Fortunately, DBMS vendors are starting to separate the implementation of these two requirements
Another limitation of some DBMSs, but one that is gradually being addressed by their vendors, is that whenever a row is moved between partitions, those entire partitions are both locked This forces application developers to design their pro-cesses so that they never update a partitioning key value on a row during prime time, because doing so locks the source and destination partitions until the move is complete As we noted, more recent releases of DBMSs reduce the locking required to move a row from one partition to another
A good partitioning strategy for an Asserted Versioning data-base is to partition by one of the temporal columns, such as the assertion end date, in order to keep the most frequently accessed data in cache As we have pointed out, that will nor-mally be currently asserted current versions of the objects of interest to the query
For an optimizer to know which partition(s) to access, it needs to know the high order of the key For direct access to the other search criteria, it needs direct access to the higher nodes in the key, higher than the search key Therefore, while one of the temporal dates is good for partitioning, it reduces the effectiveness of other search criteria To avoid this problem,
we might want to define two indexes, one for partitioning, and another for searching
The better solution for defining partitions that optimize access to currently asserted versions is to use the circa flag as the first column in the partitioning index The best predicate would be {circa_asr_flag ¼ ‘Y’} for current assertions For DBMSs which support index-look-aside processing for IN predicates, the
Trang 10best predicate might be {circa_asr_flag IN (‘Y’, ‘N’)} when it is
uncertain if the version is currently asserted With this predicate,
the index can support searches for past assertions as well as
searches for current ones Otherwise, it will require a separate
index to support searches for past assertions
Clustering
Clustering and partitioning often go together, depending on
the reason for partitioning and the way in which specific DBMSs
support it Whether or not partitioning is used, choosing the best
clustering sequence can dramatically reduce I/O and improve
performance
The general concept behind clustering is that as the database
is modified, the DBMS will attempt to keep the data on physical
pages in the same order as that specified in the clustering index
But each DBMS does this a little differently One DBMS will
clus-ter each time an insert or update is processed Another will make
a valiant attempt to do that A third will only cluster when the
table is reorganized But regardless of the approach, the result
is to reduce physical I/O by locating data that is frequently
accessed together as physically close together as possible
Early DBMSs only allowed one clustering index, but newer
releases often support multiple clustering sequences, sometimes
called indexed views or multi-dimensional clustering
It is important to determine the most frequently used access
paths to the data Often the most frequently used access paths
are ones based on one or more foreign keys For asserted version
tables, currently asserted current versions are usually the most
frequently queried data
Sometimes, the right combination of foreign keys can provide
good clustering for more than one access path For example,
suppose that a policy table has two low cardinality TFKs,
prod-uct type and market segment, and that each TFK value has
thousands of related policies.3We might then create this
cluster-ing index:
{circa_asr_flag, product_type_oid, market_segment_oid,
eff_end_dt, policy_oid}
The circa flag would cluster most of the currently asserted
rows together, keeping them physically co-located under the
lower cardinality columns Clustering would continue based on
3 Low cardinality means that there are fewer distinct values for the field in the table
which results in more rows having a single value.