Tài liệu Managing time in relational databases- P3 doc

The Future of Databases: Seamless Access to Temporal Data Let’s say that this brief history has shown a progression in making temporal data “readily available”.. 20 Chapter 1 A BRIEF HIS

Trang 1

Whenever we can specify the semantics of what we need, without having to specify the steps required to fulfill our requests, those requests are satisfied at lower cost, in less time, and more reliably SCDs stand on the wrong side of that what vs how divide Some IT professionals refer to a type 1.5 SCD Others describe types 0, 4, 5 and 6 Suffice it to say that none of these variations overcome these two fundamental limitations of SCDs SCDs

do have their place, of course They are one tool in the data manager’s toolkit Our point here is, first of all, that they are not bi-temporal In addition, even for accessing uni-temporal data, SCDs are cumbersome and costly They can, and should,

be replaced by a declarative way of requesting what data is needed without having to provide explicit directions to that data

Real-Time Data Warehouses

As for the third of these developments, it muddles the data warehousing paradigm by blurring the line between regular, periodic snapshots of tables or entire databases, and irregular as-needed before-images of rows about to be changed There

is value in the regularity of periodic snapshots, just as there is value in the regular mileposts along interstate highways Before-images of individual rows, taken just before they are updated, violate this regular snapshot paradigm, and while not destroying, certainly erode the milepost value of regular snapshots

On the other hand, periodic snapshots fail to capture changes that are overwritten by later changes, and also fail to capture inserts that are cancelled by deletes, and vice versa, when these actions all take place between one snapshot and the next As-needed row-level warehousing (real-time warehousing) will capture all of these database modifications

Both kinds of historical data have value when collected and managed properly But what we actually have, in all too many historical data warehouses today, is an ill-understood and thus poorly managed mish-mash of the two kinds of historical data

As result, these warehouses provide the best of neither world

The Future of Databases: Seamless Access

to Temporal Data Let’s say that this brief history has shown a progression in making temporal data “readily available” But what does “readily available” really mean, with respect to temporal data?

20 Chapter 1 A BRIEF HISTORY OF TEMPORAL DATA MANAGEMENT

Trang 2

One thing it might mean is “more available than by using

backups and logfiles” And the most salient feature of the

advance from backups and logfiles to these other methods of

managing historical data is that backups and logfiles require

the intervention of IT Operations to restore desired data from

off-line media, while history tables, warehouses and data marts

do not When IT Operations has to get involved, emails and

phone calls fly back and forth The Operations manager

com-plains that his personnel are already overloaded with the work

of keeping production systems running, and don’t have time

for these one-off requests, especially as those requests are being

made more and more frequently

What is going on is that the job of Operations, as its

manage-ment sees it, is to run the IT production schedule and to

com-plete that scheduled work on time Anything else is extra

Anything else is outside what their annual reviews, salary

increases and bonuses are based on

And so it is frequently necessary to bump the issue up a level,

and for Directors or even VPs within IT to talk to one another

Finally, when Operations at last agrees to restore a backup and

apply a logfile (and do the clean-up work afterwards, the

man-ager is sure to point out), it is often a few days or a few weeks

after the business use for that data led to the request being made

in the first place Soon enough, data consumers learn what a

headache it is to get access to backed-up historical data They

learn how long it takes to get the data, and so learn to do a quick

mental calculation to figure out whether or not the answer they

need is likely to be available quickly enough to check out a

hunch about next year’s optimum product mix before

produc-tion schedules are finalized, or support a posiproduc-tion they took in

a meeting which someone else has challenged They learn, in

short, to do without a lot of the data they need, to not even

bother asking for it

But instead of the comparative objective of making temporal

data “more available” than it is, given some other way of

manag-ing it, let’s formulate the absolute objective for availability of

temporal data It is, simply, for temporal data to be as quickly

and easily accessible as it needs to be We will call this the

requirement to have seamless, real-time access to what we once

believed, currently believe, or may come to believe is true about

what things of interest to us were like, are like, or may come to

be like in the future

This requirement has two parts First, it means access to

non-current states of persistent objects which is just as available to

the data consumer as is access to current states The temporal

Chapter 1 A BRIEF HISTORY OF TEMPORAL DATA MANAGEMENT 21

Trang 3

data must be available on-line, just as current data is Trans-actions to maintain temporal data must be as easy to write as are transactions to maintain current data Queries to retrieve temporal data, or a combination of temporal and current data, must be as easy to write as are queries to retrieve current data only This is the usability aspect of seamless access

Second, it means that queries which return temporal data, or

a mix of temporal and current data, must return equivalent-sized results in an equivalent amount of elapsed time This is the performance aspect of seamless access

Closing In on Seamless Access

Throughout the history of computerized data management, file access methods (e.g VSAM) and database management systems (DBMSs) have been designed and deployed to manage current data All of them have a structure for representing types of objects,

a structure for representing instances of those types, and a struc-ture for representing properties and relationships of those instances But none of them have structures for representing objects as they exist within periods of time, let alone structures for representing objects as they exist within two periods of time The earliest DBMSs supported sequential (one-to-one) and hierarchical (one-to-many) relationships among types and instances, and the main example was IBM’s IMS Later systems more directly supported network (many-to-many) relationships than did IMS Important examples were Cincom’s TOTAL, ADR’s DataCom, and Cullinet’s IDMS (the latter two now Computer Associates’ products)

Later, beginning with IBM’s System R, and Dr Michael Stonebreaker’s Ingres, Dr Ted Codd’s relational paradigm for data management began to be deployed Relational DBMSs could

do everything that network DBMSs could do, but less well understood is the fact that they could also do nothing more than network DBMSs could do Relational DBMSs prevailed over CODASYL network DBMSs because they simplified the work required to maintain and access data by supporting declaratively specified set-at-a-time operations rather than pro-cedurally specified record-at-a-time operations

Those record-at-a-time operations work like this Network DBMSs require us to retrieve or update multiple rows in tables

by coding a loop In doing so, we are writing a procedure; we are telling the computer how to retrieve the rows we are inter-ested in So we wrote these loops, and retrieved (or updated) one row at a time Sometimes we wrote code that produced

Trang 4

infinite loops when confronted with unanticipated combinations

of data Sometimes we wrote code that contained “off by one”

errors But SQL, issued against relational databases, allows us

to simply specify what results we want, e.g to say that we want

all rows where the customer status is XYZ Using SQL, there are

no infinite loops, and there are no off-by-one errors

For the most part, today’s databases are still specialized for

managing current data, data that tells us what we currently

believe things are currently like Everything else is an exception

Nonetheless, we can make historical data accessible to queries

by organizing it into specialized databases, or into specialized

tables within databases, or even into specialized rows within

tables that also contain current data

But each of these ways of accommodating historical data

requires extra work on the part of IT personnel Each of these ways

of accommodating historical data goes beyond the basic paradigm

of one table for every type of object, and one row for every instance

of a type And so DBMSs don’t come with built-in support for these

structures that contain historical data We developers have to

design, deploy and manage these structures ourselves In addition,

we must design, deploy and manage the code that maintains

his-torical data, because this code goes beyond the basic paradigm of

inserting a row for a new object, retrieving, updating and rewriting

a row for an object that has changed, and deleting a row for an

object no longer of interest to us

We developers must also design, deploy and maintain code to

simplify the retrieval of instances of historical data SQL, and the

various reporting and querying tools that generate it, supports

the basic paradigm used to access current data This is the

para-digm of choosing one or more rows from a target table by

specifying selection criteria, projecting one or more columns

by listing the columns to be included in the query’s result set,

and joining from one table to another by specifying match or

other qualifying criteria from selected rows to other rows

When different rows represent objects at different periods of

time, transactions to insert, update and delete data must specify

not just the object, but also the period of time of interest When

different rows represent different statements about what was

true about the same object at a specified period of time, those

transactions must specify two periods of time in addition to

the object

Queries also become more complex When different rows

rep-resent objects at different points in time, queries must specify

not just the object, but also the point in time of interest When

different rows represent different statements about what was

Trang 5

true about the same object at the same point in time, queries must specify two points in time in addition to the criteria which designate the object or objects of interest

We believe that the relational model, with its supporting the-ory and technology, is now in much the same position that the CODASYL network model, with its supporting theory and tech-nology, was three decades ago It is in the same position, in the following sense

Relational DBMSs were never able to do anything with data that network DBMSs could not do Both supported sequential, hierarchical and network relationships among instances of types

of data The difference was in how much work was required on the part of IT personnel and end users to maintain and access the managed data

And now we have the relational model, a model invented by

Dr E F Codd An underlying assumption of the relational model

is that it deals with current data only But temporal data can be managed with relational technology Dr Snodgrass’s book describes how current relational technology can be adapted to handle temporal data, and indeed to handle data along two orthogonal temporal dimensions But in the process of doing

so, it also shows how difficult it is to do

In today’s world, the assumption is that DBMSs manage cur-rent data But we are moving into a world in which DBMSs will

be called on to manage data which describes the past, present

or future states of objects, and the past, present or future assertions made about those states Of this two-dimensional temporalization of data describing what we believe about how things are in the world, currently true and currently asserted data will always be the default state of data managed in a data-base and retrieved from it But overrides to those defaults should

be specifiable declaratively, simply by specifying points in time other than right now for versions of objects and also for assertions about those versions

Asserted Versioning provides seamless, real-time access to bi-temporal data, and provides mechanisms which support the declarative specification of bi-temporal parameters on both main-tenance transactions and on queries against bi-temporal data

Glossary References Glossary entries whose definitions form strong inter-dependencies are grouped together in the following list The same glossary entries may be grouped together in different ways

Trang 6

at the end of different chapters, each grouping reflecting the

semantic perspective of each chapter There will usually be

sev-eral other, and often many other, glossary entries that are not

included in the list, and we recommend that the Glossary be

consulted whenever an unfamiliar term is encountered

effective time

valid time

event

state

external pipeline dataset, history table

transaction table

version table

instance

type

object

persistent object

thing

seamless access

seamless access, performance aspect

seamless access, usability aspect

Trang 7

A TAXONOMY OF BI-TEMPORAL

DATA MANAGEMENT METHODS

CONTENTS

Taxonomies 28

Partitioned Semantic Trees 28

Jointly Exhaustive 31

Mutually Exclusive 32

A Taxonomy of Methods for Managing Temporal Data 34

The Root Node of the Taxonomy 35

Queryable Temporal Data: Events and States 37

State Temporal Data: Uni-Temporal and Bi-Temporal Data 41

Glossary References 46

In Chapter 1, we presented an historical account of various

ways that temporal data has been managed with computers In

this chapter, we will develop a taxonomy, and situate those

met-hods described in Chapter 1, as well as several variations on

them, in this taxonomy

A taxonomy is a special kind of hierarchy It is a hierarchy which

is a partitioning of the instances of its highest-level node into

differ-ent kinds, types or classes of things While an historical approach

tells us how things came to be, and how they evolved over time, a

taxonomic approach tells us what kinds of things we have come

up with, and what their similarities and differences are In both

cases, i.e in the previous chapter and in this one, the purpose is

to provide the background for our later discussions of temporal

data management and, in particular, of how Asserted Versioning

supports non-temporal, uni-temporal and bi-temporal data by

means of physical bi-temporal tables.1

1 Because Asserted Versioning directly manages bi-temporal tables, and supports

uni-temporal tables as views on bi-uni-temporal tables, we sometimes refer to it as a method

of bi-temporal data management and at other times refer to it as a method of

temporal data management The difference in terminology, then, reflects simply a

difference in emphasis which may vary depending on context.

Managing Time in Relational Databases Doi: 10.1016/B978-0-12-375041-9.00002-9

Copyright # 2010 Elsevier Inc All rights of reproduction in any form reserved. 27

Trang 8

Taxonomies Originally, the word “taxonomy” referred to a method of clas-sification used in biology, and introduced into that science in the

18thcentury by Carl Linnaeus Taxonomy in biology began as a system of classification based on morphological similarities and differences among groups of living things But with the modern synthesis of Darwinian evolutionary theory, Mendelian genetics, and the Watson–Crick discovery of the molecular basis

of life and its foundations in the chemistry of DNA, biological taxonomy has, for the most part, become a system of classifica-tion based on common genetic ancestry

Partitioned Semantic Trees

As borrowed by computer scientists, the term “taxonomy” refers to a partitioned semantic tree A tree structure is a hierar-chy, which is a set of non-looping (acyclic) one-to-many relationships In each relationship, the item on the “one” side is called the parent item in the relationship, and the one or more items on the “many” side are called the child items The items that are related are often called nodes of the hierarchy Continuing the arboreal metaphor, a tree consists of one root node (usually shown at the top of the structure, and not, as the metaphor would lead one to expect, at the bottom), zero or more branch nodes, and zero or more leaf nodes on each branch This terminology is illustrated inFigure 2.1

Tree structure Each taxonomy is a hierarchy Therefore, except for the root node, every node has exactly one parent node Except for the leaf nodes, unless the hierarchy consists of

Party

root node

Organization Person

branch node

leaf nodes

Figure 2.1 An Illustrative Taxonomy

28 Chapter 2 A TAXONOMY OF BI-TEMPORAL DATA MANAGEMENT METHODS

Trang 9

the root node only, every node has at least one child node Each

node except the root node has as ancestors all the nodes from its

direct parent node, in linear ascent from child to parent, up to

and including the root node No node can be a parent to any

of its ancestors

Partitioned The set of child nodes under a given parent node

are jointly exhaustive and mutually exclusive Being jointly

exhaustive means that every instance of a parent node is also

an instance of one of its child nodes Being mutually exclusive

means that no instance of a parent node is an instance of more

than one of its child nodes A corollary is that every instance of

the root node is also an instance of one and only one leaf node

Semantic The relationships between nodes are often called

links The links between nodes, and between instances and

nodes, are based on the meaning of those nodes Conventionally,

node-to-node relationships are called KIND-OF links, because

each child node can be said to be a kind of its parent node

In our illustrative taxonomy, shown in Figure 2.1, for example,

Supplier is a kind of Organization

A leaf node, and only a leaf node, can be the direct parent of

an instance Instances are individual things of the type indicated

by that node The relationship between individuals and the (leaf

and non-leaf) nodes they are instances of is called an IS-A

rela-tionship, because each instance is an instance of its node Our

company may have a supplier, let us suppose, called the Acme

Company In our illustrative taxonomy shown in Figure 2.1,

therefore, Acme is a direct instance of a Supplier, and indirectly

an instance of an Organization and of a Party In ordinary

con-versation, we usually drop the “instance of” phrase, and would

say simply that Acme is a supplier, an organization and a party

Among IT professionals, taxonomies have been used in data

models for many years They are the exclusive subtype hierarchies

defined in logical data models, and in the (single-inheritance)

class hierarchies defined in object-oriented models An example

familiar to most data modelers is the entity Party Under it are

the two entities Person and Organization The business rule for

this two-level hierarchy is: every party is either a person or an

organization (but not both) This hierarchy could be extended

for as many levels as are useful for a specific modeling

require-ment For example, Organization might be partitioned into

Supplier, Self and Customer This particular taxonomy is shown

inFigure 2.1

We note that most data modelers, on the assumption that this

taxonomy would be implemented as a subtype hierarchy in a

logical data model, will recognize right away that it is not a very

Chapter 2 A TAXONOMY OF BI-TEMPORAL DATA MANAGEMENT METHODS 29

Trang 10

good taxonomy For one thing, it says that persons are not customers But many companies do sell their goods or services

to people; so for them, this is a bad taxonomy Either the label

“customer” is being used in a specialized (and misleading) way,

or else the taxonomy is simply wrong

A related mistake is that, for most companies, Supplier, Self and Customer are not mutually exclusive For example, many companies sell their goods or services to other companies who are also suppliers to them If this is the case, then this hierarchy

is not a taxonomy, because an instance—a company that is both

a supplier and a customer—belongs to more than one leaf node

As a data modeling subtype hierarchy, it is a non-exclusive hier-archy, not an exclusive one

This specific taxonomy has nothing to do with temporal data management; but it does give us an opportunity to make an important point that most data modelers will understand That point is that even very bad data models can be and often are, put into production And when that happens, the price that is paid

is confusion: confusion about what the entities of the model really represent and thus where data about something of interest can be found within the database, what sums and averages over a given entity really mean, and so on

In this case, for example, some organizations may be represented by a row in only one of these three tables, but other organizations may be represented by rows in two or even in all three of them Queries which extract statistics from this hierar-chy must now be written very carefully, to avoid the possibility

of double- or triple-counting organizational metrics

As well as all this, the company may quite reasonably want to keep a row in the Customer table for every customer, whether it

be an organization or a person This requires an even more con-fusing use of the taxonomy, because while an organization might

be represented multiple times in this taxonomy, at least it is still possible to find additional information about organizational customers in the parent node But this is not possible when those customers are persons

So the data modeler will want to modify the hierarchy so that persons can be included as customers There are various ways to do this, but if the hierarchy is already populated and

in use, none of them are likely to be implemented The cost is just too high Queries and code, and the result sets and reports based on them, have already been written, and are already in production If the hierarchy is modified, all those queries and all that code will have to be modified The path of least resis-tance is an unfortunate one It is to leave the whole mess alone,

30 Chapter 2 A TAXONOMY OF BI-TEMPORAL DATA MANAGEMENT METHODS

Định dạng
Số trang	20
Dung lượng	204,85 KB