26.1.4 Building a Data Warehouse26.1.5 Typical Functionality of Data Warehouses 26.1.6 Difficulties of Implementing Data Warehouses 26.1.7 Open Issues in Data Warehousing Because data wa
Trang 2constants, or if a variable is repeated twice in the rule head, it can easily be rectified: a constant c is replaced by a variable X, and a predicate equal(X, c) is added to the rule body Similarly, if a variable Y appears twice in a rule head, one of those occurrences is replaced by another variable Z, and
a predicate equal(Y, Z) is added to the rule body
The evaluation of a nonrecursive query can be expressed as a tree whose leaves are the base relations What is needed is appropriate application of the relational operations of SELECT, PROJECT, and JOIN, together with set operations of UNION and SET DIFFERENCE, until the predicate in the query gets evaluated An outline of an inference algorithm GET_EXPR(Q) that generates a relational
expression for computing the result of a DATALOG query Q = p(arg1, arg2, , argn) can informally be stated as follows:
1 Locate all rules S whose head involves the predicate p If there are no such rules, then p is a fact-defined predicate corresponding to some database relation Rp; in this case, one of the following expressions is returned and the algorithm is terminated (we use the notation $i to refer to the name of the ith attribute of relation Rp);
a If all arguments are distinct variables, the relational expression returned is Rp
b If some arguments are constants or if the same variable appears in more than one argument position, the expression returned is
SELECT<condition>(Rp), where the selection <condition> is a conjunctive condition made up of a number
of simple conditions connected by AND, and constructed as follows:
i if a constant c appears as argument i, include a simple condition ($i = c) in the conjunction
ii If the same variable appears in both argument locations j and k, include a condition ($j = $k) in the conjunction
c For an argument that is not present in any predicate, a unary relation containing values that satisfy all conditions is constructed Since the rule is assumed to be safe, this unary relation must be finite
2 At this point, one or more rules Si, i = 1, 2, , n, n > 0 exist with predicate p as their head For each such rule Si, generate a relational expression as follows:
a Apply selection operations on the predicates in the RHS for each such rule, as discussed in Step 1
b A natural join is constructed among the relations that correspond to the predicates in the body of the rule Si over the common variables For arguments that gave rise to the unary relations in Step 1(c), the corresponding relations are brought as members into the natural join Let the resulting relation from this join be Rs
c If any built-in predicate X h Y was defined over the arguments X and Y, the result of the join is subjected to an additional selection:
SELECTX h Y(Rs),
d Repeat Step 2(b) until no more built-in predicates apply
3 Take the UNION of the expressions generated in Step 2 (if more than one rule exists with predicate p as its head)
25.5.4 Concepts for Recursive Query Processing in Datalog
Trang 3Naive Strategy
Seminaive Strategy
The Magic Set Rule Rewriting Technique
Query processing can be separated into two approaches:
• Pure evaluation approach: Creating a query evaluation plan that produces an answer to the
query
• Rule rewriting approach: Optimizing the plan into a more efficient strategy
Many approaches have been presented for both recursive and nonrecursive queries We discussed an approach to nonrecursive query evaluation earlier Here we first define some terminology for recursive queries, then discuss the naive and seminaive approaches to query evaluation—which generate simple
plans—and then present the magic set approach—which is an optimization based on rule rewriting
We have already seen examples involving recursive rules where the same predicate occurs in the head and in the body of a rule Another example is
ancestor(X,Y) :- ancestor(X,Z), parent(Z,Y)
which states that Y is an ancestor of X if Z is an ancestor of X and Y is a parent of Z It is in
conjunction with the rule
ancestor(X,Y) :- parent (X,Y)
which states that if Y is a parent of X, then Y is an ancestor of X
A rule is said to be linearly recursive if the recursive predicate appears once and only once in the RHS
of the rule For example,
sg(X,Y) :- parent(X,XP), parent(Y,YP), sg(XP,YP)
is a linear rule in which the predicate sg (same-generation cousins) is used only once in RHS The rule states that X and Y are same-generation cousins if their parents are same-generation cousins The rule
Trang 4ancestor(X,Y) :- ancestor(X,Z), parent(Z,Y)
is called left linearly recursive, while the rule
ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y),
is called right linearly recursive
Notice that the rule
ancestor(X,Y) :- ancestor(X,Z), ancestor(Z,Y)
is not linearly recursive It is believed that most "real-life" rules can be described as linear recursive
rules; algorithms have been defined to execute linear sets of rules efficiently The preceding definitions become more involved when a set of rules with predicates that occur on both the LHS and the RHS of rules are considered
A predicate whose relation is stored in the database is called an extensional database (EDB)
predicate, while a predicate for which the corresponding relation is defined by logical rules is called an intensional database (IDB) predicate Given a Datalog program with relations corresponding to the
predicates, the "if" symbol, :-, may be replaced by an equality to form Datalog equations, without any
loss of meaning The resulting set of Datalog equations could potentially have many solutions In a set
of relations for the EDB predicates, say R1, R2, , Rn, a fixed point of the Datalog equations is a solution for the relations corresponding to the IDB predicates of those equations
The fixed point with respect to the given EDB relations, along with those relations, forms a model of the rules from which the Datalog equations were derived However, it is not true that every model of a set of Datalog rules is a fixed point of the corresponding Datalog equations, because the model may have "too many" facts It turns out that Datalog programs each have a unique minimal model
containing any given EDB relations, and this also corresponds to the unique minimal fixed point, with respect to those EDB relations
Formally, given a family of solutions Si = P1(i), ,Pm(i), to a given set of equations, the least fixed point
of a set of equations is obtained by finding the solution whose corresponding relations are the smallest proper subsets for all relations For example, we say S1 1 S2, if relation Pk(1) is a subset of relation Pk(2) for all k, 1 1 k 1 m Fixpoint theory was first developed in the field of recursion theory as a tool for explaining recursive functions Since Datalog has an ability to express recursion, fixpoint theory is well suited for describing the semantics of recursive functions
For example, if we represent a directed graph by the predicate edge(X,Y) such that edge (X,Y) is true if and only if there is an edge from node X to node Y in the graph, the paths in the graph may be
expressed by the following rules:
Trang 5path(X,Y) :- edge(X,Y)
path(X,Y) :- path(X,Z), path (Z,Y)
Notice that there are other ways of defining paths recursively Let us assume that relations P and A
correspond to the predicates path and edge in the preceding rules The transitive closure of relation P
contains all possible pairs of nodes that have a path between them, and it corresponds to the least point solution corresponding to the equations that result from the preceding rules (Note 6) These rules can be turned into a single equation for the relation P corresponding to the predicate edge
fixed-P(X,Y) = A(X,Y) D pX,Y (P(X,Z)P(Z,Y))
Suppose that the nodes are 3,4,5 and A = {(3,4), (4,5)} From the first and second rules we can infer that (3,4), (4,5) and (3,5) are in P We need not look for any other paths, because P = {(3,4),(4,5),(3,5)}
is a solution of the above equation:
{(3,4),(4,5),(3,5)} = {(3,4),(4,5)}D p X,Y ({(3,4),(4,5),(3,5)}{ (3,4),(4,5),(3.5)})
This solution constitutes a proof theoretic meaning of the rules, as it was derived from the EDB relation
A, using just the rules It is also the minimal model of the rules or the least fixed point of the equation
For evaluating a set of Datalog rules (equations) that may contain recursive rules, a large number of strategies have been proposed, details of which are beyond our scope Here we illustrate three
important techniques: the naive strategy, the seminaive strategy, and the use of magic sets
Naive Strategy
The naive evaluation method is a pure evaluation, bottom-up strategy which computes the least model
of a Datalog program It is an iterative strategy and at each iteration all rules are applied to the set of tuples produced thus far to generate all implicit tuples This iterative process continues until no more new tuples can be generated
The naive evaluation process does not take into account query patterns As a result, a considerable amount of redundant computation is done We present two versions of the naive method, called Jacobi
Trang 6and Gauss-Seidel solution methods; these methods get their names from well known algorithms for the iterative solution of systems of equations in numerical analysis
Assume the following system of relational equations, formed by replacing the :- symbol by an equality sign in a Datalog program
Ri = Ei (R1, R2, , Rn)
The Jacobi method proceeds as follows Initially, the variable relations Ri are set equal to the empty set Then, the computation Ri = Ei (R1, R2, , Rn), i = 1, , n is iterated until none of the Ri changes
between two consecutive iterations (i.e., until the Ri reach a fixpoint)
Algorithm 25.1 Jacobi naive strategy
Input: A system of algebraic equations and an EDB
Output: The values of the variable relations R1, R2, , Rn
Trang 7The convergence of the Jacobi method can be slightly improved if, at each step k, in order to compute the new value Ri(k) , we substitute in Ei the values of Rj(k) that have just been computed in the same iteration instead of the old values Rj(k - 1) This variant of the Jacobi method is called the Gauss-Seidel method, which produces the same result as the Jacobi algorithm Consider the following example where ancestor(X, Y) means X is ancestor of Y; parent(X, Y) means X is parent of Y
ancestor(X,Y) :- parent(X,Y)
ancestor(X,Y) :- ancestor(X,Z), parent(Z,Y)
If we define a relation A for the predicate ancestor and P for the parent, the Datalog equation for the above rules can be written in the form:
A(X,Y) = pX,Y (A(X,Z)P(Z,Y)) D A(X,Y)
Suppose the EDB is given as P = {(bert, alice), (bert, george), (alice, derek), (alice, pat), (derek, frank)} Let us follow the Jacobi algorithm The parent tree looks as in Figure 25.09
Initially, we set A(0) = , enter the repeat loop, and set condition = true We then initialize S1= A = ,
then compute the first value of A Since the first join involves an empty relation, we get
A(1) = P = {(bert, alice),(bert, george),(alice, derek),(alice, pat),(derek, frank)}
A(1) includes parents as ancestors A(1) S1, thus condition = false We therefore enter the second iteration with S1 set to A(1) Computing the value of A again, we get,
A(2) = P = {(bert,alice),(bert,george),(alice,derek),(alice,pat),(derek,frank), (bert,derek), (bert,pat), (alice,frank)}
Trang 8It can be seen that A(2) = A(1) D {(bert,derek), (bert,pat), (alice,frank)} Note that A(2) now includes grandparents as ancestors besides parents Since A(2) S1, we iterate again, setting S1 to A(2):
A(3) = P = {(bert,alice),(bert,george),(alice,derek),(alice,pat),(derek,frank), (bert,derek), (bert,pat), (alice,frank), (bert,frank)}
Now, A(3) = A(2) D {(bert,frank)} A(3) now has great grandparents included among ancestors Since A(3)
is different from S1, we enter the next iteration, setting S1 = A(3) We now get,
A(4) = P = {(bert,alice),(bert,george),(alice,derek),(alice,pat),(derek,frank), (bert,derek), (bert,pat), (alice,frank), (bert,frank)}
Finally, A(4) = A(3) = S1, the evaluation is finished Intuitively, from the above parental hierarchy, it is obvious that all ancestors have been computed
Seminaive Strategy
Seminaive evaluation is a bottom-up technique designed to eliminate redundancy in the evaluation of tuples at different iterations This method does not use any information about the structure of the program There are two possible settings of the seminaive algorithm: the (pure) seminaive and the pseudo rewriting seminaive
Consider the Jacobi algorithm Let Ri(k) be the temporary value of relation Ri at iteration Step k The differential of Ri at Step k of the iteration is defined as,
Di(k) = Ri(k) -Ri(k-1)
When the whole system is linear, Di can be substituted for Ri in the Jacobi or Gauss-Seidel algorithms: the result is obtained by the union of the newly obtained term and the old one
Trang 9Algorithm 25.2 Seminaive strategy
input: A system of algebraic equations and an EDB
output: The values of the variable relations R1, R2, , Rn
D(0) = , A(0) =
D(1) = P = {(bert,alice),(bert,george),
(alice,derek),(alice,pat), (derek,frank)}
Hence,
Trang 10(bert,pat), (alice,frank)}- A(1)
= {(bert,derek), (bert,pat), (alice,frank)}
The Magic Set Rule Rewriting Technique
The problem addressed by the magic sets rule rewriting technique is that frequently a query asks not for the entire relation corresponding to an intentional predicate but for a small subset of this relation Consider the following program:
Trang 11sg(X,Y) :- flat(X,Y)
sg(X,Y) :- up(X,U), sg(U,V), down(V,Y)
Here, sg is a predicate ("same-generation cousin"), and the head of each of the two rules is the atomic formula sg(X, Y) The other predicates found in the rules are flat, up, and down These are presumably stored extensionally as facts, while the relation for sg is intentional—that is, defined only
by the rules For a query like sg(john, Z)—that is, "who are the same generation cousins of
John?"—asked of the predicate, our answer to the query must examine only the part of the database
that is relevant—namely the part that involves individuals somehow connected to John
A top-down, or backward-chaining search would start from the query as a goal and use the rules from head to body to create more goals; none of these goals would be irrelevant to the query, although some might cause us to explore paths that happen to "deadend." On the other hand, a bottom-up or forward-chaining search, working from the bodies of the rules to the heads, would cause us to infer sg facts that would never even be considered in the top-down search Yet bottom-up evaluation is desirable because
it avoids the problems of looping and repeated computation that are inherent in the top-down approach, and allow us to use set-at-a-time operations, such as relational joins
Magic sets rule rewriting is a technique that allows us to rewrite the rules as a function of the query form only—that is, it considers which arguments of the predicate are bound to constants and which are variable, so that the advantages of top-down and bottom-up methods are combined The technique focuses on the goal inherent in the top-down evaluation but combines this with the looping freedom, easy termination testing, and efficient evaluation of bottom-up evaluation Instead of giving the method, of which many variations are known and used in practice, we explain the idea with an
example
Given the previously stated rules and the query sg(john, Z), a typical magic sets transformation of the rules would be
sg(X,Y) :-magic-sg(X), flat (X,Y)
sg(X,Y) :-magic-sg(X), up(X,U), sg(U,V), down(V,Y)
magic-sg(U) :-magic-sg(X), up(X,U)
Trang 12While the magic sets technique was originally developed to deal with recursive queries, it is applicable
to nonrecursive queries as well Indeed, it has been adapted to deal with SQL queries (which contain features such as grouping, aggregation, arithmetic conditions, and multiset relations that are not present
in pure logic queries), and it has been found to be useful for evaluating nonrecursive "nested" SQL queries
25.5.5 Stratified Negation
A deductive database query language can be enhanced by permitting negated literals in the bodies of rules in programs However, the important property of rules, called the minimal model, which we
discussed earlier, does not hold In the presence of negated literals, a program may not have a minimal
or least model For example, the program
p(a):- not p(b)
has two minimal models: {p(a)} and {p(b)}
A detailed analysis of the concept of negation is beyond our scope But for practical purposes, we next
discuss stratified negation, an important notion used in deductive system implementations
The meaning of a program with negation is usually given by some "intended" model The challenge is
to develop algorithms for choosing an intended model that does the following:
1 Makes sense to the user of the rules
2 Allows us to answer queries about the model efficiently
In particular, it is desirable that the model work well with the magic sets transformation, in the sense that we can modify the rules by some suitable generalization of magic sets, and the resulting rules allow (only) the relevant portion of the selected model to be computed efficiently (Alternatively, other efficient evaluation techniques must be developed.)
One important class of negation that has been extensively studied is stratified negation A program is
stratified if there is no recursion through negation Programs in this class have a very intuitive
semantics and can be efficiently evaluated The example that follows describes a stratified program Consider the following program P2:
r1: ancestor(X,Y) :- parent (X,Y)
r2: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y)
r3: nocyc(X,Y):- ancestor(X,Y), not (ancestor(Y,X))
Trang 13Notice that the third rule has a negative literal in its body This program is stratified because the definition of the predicate nocyc depends (negatively) on the definition of ancestor, but the
definition of ancestor does not depend on the definition of nocyc We are not equipped to give a
more formal definition without giving additional notation and definitions A bottom-up evaluation of P2 would first compute a fixed point of rules r1 and r2 (the rules defining ancestor) Rule r3 is applied only when all the ancestor facts are known
A natural extension of stratified programs is the class of locally stratified programs Intuitively, a
program P is locally stratified for a given database if, when we substitute constants for variables in all
possible ways, the resulting instantiated rules do not have any recursion through negation
25.6 Deductive Database Systems
25.6.1 The LDL System
25.6.2 NAIL!
25.6.3 The CORAL System
The founding event of the deductive database field can be considered to be the Toulouse workshop on
"Logic and Databases" organized by Gallaire, Minker, and Nicolas in 1977 The next period of the explosive growth started with the setting up of the MCC (Microelectronics and Computer Technology Corporation), which was a reaction to the Japanese Fifth Generation Project Several experimental deductive database systems have been developed and a few have been commercially deployed In this section we briefly review three different implementations of the ideas presented so far: LDL, NAIL!, and CORAL
25.6.1 The LDL System
Background, Motivation, and Overview
The LDL Data Model and Language
The Logic Data Language (LDL) project at Microelectronics and Computer Technology Corporation (MCC) was started in 1984 with two primary objectives:
• To develop a system that extends the relational model yet exploits some of the desirable features of an RDBMS (relational database management system)
• To enhance the functionality of a DBMS so that it works as a deductive DBMS and also supports the development of general-purpose applications
The resulting system is now a deductive DBMS made available as a product In this section, we briefly survey the highlights of the technical approach taken by LDL and consider its important features
Background, Motivation, and Overview
The design of the LDL language may be viewed as a rule-based extension to domain calculus-based languages (see Section 9.4) The LDL system has tried to combine the expressive capability of Prolog with the functionality and facility of a general-purpose DBMS The main drawback experienced by earlier systems that coupled Prolog with an RDBMS is that Prolog is navigational (tuple-at-a-time)
Trang 14whereas in RDBMSs the user formulates a correct query and leaves the optimization of query
execution to the system The navigational nature of Prolog is manifested in the ordering of rules and goals to achieve an optimal execution and termination Two options are available:
• Make Prolog more "database-like" by adding navigational database management features (For an example of navigational query language, see the network model DML in Section C.4
of Appendix C.)
• Modify Prolog into a general-purpose declarative logic language
The latter option was chosen in LDL, yielding a language that is different from Prolog in its constructs and style of programming in the following ways:
• Rules are compiled in LDL
• There is a notion of a "schema" of the fact base in LDL at compile time The fact base is freely updated at run-time Prolog, on the other hand, treats facts and rules identically, and it subjects facts to interpretation when they are changed
• LDL does not follow the resolution and unification technique used in Prolog systems that are based on backward chaining
• The LDL execution model is simpler, based on the operation of matching and the computation
of "least fixed points." These operators, in turn, use simple extensions to the relation algebra
The first LDL implementation, completed in 1987, was based on a language called FAD A later implementation, completed in 1988, is called SALAD and underwent further changes as it was tested against the "real-life" applications described in Section 25.8 The current prototype is an efficient portable system for UNIX that assumes a single tuple get next interface between the compiled LDL program and an underlying fact manager
The LDL Data Model and Language
With the design philosophy of LDL being to combine the declarative style of relational languages with the expressive power of Prolog, constructs in Prolog such as negation, set-of, updates, and cut have been dropped Instead, the declarative semantics of Horn clauses was extended to support complex
terms through the use of function symbols, called functors in Prolog
A particular employee record can therefore be defined as follows:
Employee (Name (John Doe), Job(VP),
Education ({(High school, 1961),
(College (Fergusson, bs, physics), 1965),
(College (Michigan, phd, ie), 1976)}))
In the preceding record, VP is a simple term, whereas education is a complex term that consists of a term for high school and a nested relation containing the term for college and the year of graduation LDL thus supports complex objects with an arbitrarily complex structure including lists, set terms,
Trang 15trees, and nested relations We can think of a compound term as a Prolog structure with the function symbol as the functor
LDL allows updates in the bodies of rules For instance, a rule
happy (Dept, Raise, Name) <-
emp (Name, Dept, Sal), Newsal = Sal+Raise
-emp (Name, Dept, -), +emp(Name,Dept,Newsal)
Even though LDL’s semantics is defined in a bottom-up fashion (for example, via stratification), the implementor can use any execution that is faithful to this declarative semantics In particular, the execution can proceed bottom-up or top-down, or it may be a hybrid execution These choices enable the compiler/optimizer to be selective in customizing the most appropriate modes of execution for the given program The LDL compiler and optimizer can select from among several strategies: pipelined or lazy pipelined execution, materialized or lazy materialized execution
25.6.2 NAIL!
The NAIL! (Not Another Implementation of Logic!) project was started at Stanford University in 1985 The initial goal was to study the optimization of logic by using the database-oriented "all-solutions" model The aim of the project was to support the optimal execution of Datalog goals over an RDBMS Assuming that a single workable strategy was inappropriate for all logic programs in general, an extensible architecture was developed, which could be enhanced through progressive additions
Trang 16In collaboration with the MCC group, this project was responsible for the idea of magic sets and the first work on regular recursions In addition, many important contributions to coping with negation and aggregation on logical rules were made by the project, including stratified negation, well-founded negation, and modularly stratified negation The architecture of NAIL! is illustrated in Figure 25.10
The preprocessor rewrites the source NAIL! program by isolating "negation" and "set" operators, and
by replacing disjunction with several conjunctive rules After preprocessing, the NAIL! program is represented through its predicates and rules The strategy selection module takes as input the user’s goal and produces as output the best execution strategies for solving the user’s goal and all the other goals related to it, using the internal language ICODE
The ICODE statements produced as a result of the strategy selection process are optimized and then executed through an interpreter, which translates ICODE retrieval statements to SQL when needed
An initial prototype system was built but later abandoned because the purely declarative paradigm was found to be unworkable for many applications The revised system uses a core language, called GLUE, which is essentially single logical rules, with the power of SQL statements, wrapped in conventional language constructs such as loops, procedures, and modules The original NAIL! language becomes a view mechanism for GLUE; it permits fully declarative specifications in situations where
declarativeness is appropriate
25.6.3 The CORAL System
The CORAL system, which was developed at the University of Wisconsin at Madison, builds on experience gained from the LDL project Like LDL, the system provides a declarative language based
on Horn clauses with an open architecture There are many important differences, however, in both the language and its implementation The CORAL system can be seen as a database programming
language that combines important features of SQL and Prolog
From a language standpoint, CORAL adapts LDL’s set-grouping construct to be closer to SQL’s GROUP BY construct For example, consider
budget(Dname,sum(<Sal>)) :- dept(Dname,Ename,Sal)
This rule computes one budget tuple for each department, and each salary value is added as often as there are people with that salary in the given department In LDL, the grouping and the sum operation
cannot be combined in one step; more importantly, the grouping is defined to produce a set of salaries
for each department Therefore, computing the budget is harder in LDL A related point is that SQL
supports a multiset semantics for queries when the DISTINCT clause is not specified CORAL supports
such a multiset semantics as well Thus the following rule can be defined to compute either a set of tuples or a multiset of tuples in CORAL, as occurs in SQL:
Trang 17budget2(Dname,Sal) :- dept(Dname,Ename,Sal)
This raises an important point: How can a user specify which semantics (set or multiset) is desired? In
SQL, the keyword DISTINCT is used; similarly, an annotation is provided in CORAL In fact,
CORAL supports a number of annotations that can be used to choose a desired semantics or to provide optimization hints to the CORAL system The added complexity of queries in a recursive language makes optimization difficult, and the use of annotations often makes a big difference in the quality of the optimized evaluation plan
CORAL supports a class of programs with negation and grouping that is strictly larger than the class of stratified programs The bill-of-materials problem, in which the cost of a composite part is defined as being the sum of the costs of all atomic parts, is an example of a problem that requires this added generality
CORAL is closer to Prolog than to LDL in supporting nonground tuples; thus, the tuple equal(X,X) can
be stored in the database and denotes that every binary tuple in which the first and the second field values are the same is in the relation called equal From an evaluation standpoint, CORAL’s main evaluation techniques are based on bottom-up evaluation, which is very different from Prolog’s top-down evaluation However, CORAL also provides a Prolog-like top-down evaluation mode
From an implementation perspective, CORAL implements several optimizations to deal with
nonground tuples efficiently, in addition to techniques such as magic templates for pushing selections into recursive queries, pushing projections, and special optimizations of different kinds of (left- and right-) linear programs It also provides an efficient way to compute nonstratified queries A "shallow-compilation" approach is used, whereby the run-time system interprets the compiled plan CORAL uses the EXODUS storage manager to provide support for disk-resident relations It also has a good
interface with C++ and is extensible, enabling a user to customize the system for special applications
by adding new data types or relation implementations An interesting feature is an explanation package
that allows a user to examine graphically how a fact is generated; this is useful for debugging as well as for providing explanations
25.7 Deductive Object-Oriented Databases
25.7.1 Overview of DOODs
25.7.2 VALIDITY
The emergence of deductive database concepts is contemporaneous with initial work in Logic
Programming Deductive object-oriented databases (DOODs) came about through the integration of the
OO paradigm and logic programming The observation that OO and deductive database systems generally have complementary strengths and weaknesses gave rise to the integration of the two
paradigms
25.7.1 Overview of DOODs
Trang 18Since the late 1980s, several DOOD prototypes were developed in universities and research
laboratories VALIDITY, which was developed at Bull, is the first industrial product in the DOOD arena The LDL and the CORAL systems we reviewed offer some additional object-orientated
features—e.g., in CORAL++ —and may be considered as DOODs
The following broad approaches have been adopted in the design of DOOD systems:
• Language extension: An existing deductive language model is extended with object-oriented
features For example, Datalog is extended to support identity, inheritance, and other OO features
• Language integration: A deductive language is integrated with an imperative programming
language in the context of an object model or type system The resulting system supports a range of standard programs, while allowing different and complementary programming paradigms to be used for different tasks, or for different parts of the same task This approach was pioneered by the Glue-Nail system
• Language reconstruction: An object model is reconstructed, creating a new logic language
that includes object-oriented features In this strategy, the goal is to develop an object logic that captures the essentials of the object-oriented paradigm and that can also be used as a deductive programming language in DOODs The rationale behind this approach is the argument that language extensions fail to combine object-orientation and logic successfully,
by losing declarativenesss or by failing to capture all aspects of the object-oriented model
25.7.2 VALIDITY
DEL Data Model
VALIDITY combines deductive capabilities with the ability to manipulate complex objects (OIDs, inheritance, methods, etc.) The ability to declaratively specify knowledge as deduction and integration rules brings knowledge independence Moreover, the logic-based language of deductive databases enables advanced tools, such as those for checking the consistency of a set of rules, to be developed When compared with systems extending SQL technology, deductive systems offer more expressive declarative languages and cleaner semantics VALIDITY provides the following:
1 A DOOD data model and language, called DEL (Datalog Extended Language)
2 An engine working along a client-server model
3 A set of tools for schema and rule editing, validation, and querying
The DEL data model provides object-oriented capabilities, similar to those offered by the ODMG data model (see Chapter 12), and includes both declarative and imperative features The declarative features include deductive and integrity rules, with full recursion, stratified negation, disjunction, grouping, and quantification The imperative features allow functions and methods to be written The engine of VALIDITY integrates the traditional functions of a database (persistency, concurrency control, crash recovery, etc.) with the advanced deductive capabilities for deriving information and verifying
semantic integrity The lowest level component of the engine is a fact manager that integrates storage, concurrency control, and recovery functions The fact manager supports fact identity and complex data items In addition to locking, the concurrency control protocol integrates read-consistency technology, used in particular when verifying constraints The higher-level component supports the DEL language and performs optimization, compilation, and execution of statements and queries The engine also supports an SQL interface permitting SQL queries and updates to be run on VALIDITY data
VALIDITY also has a deductive wrapper for SQL systems, called DELite This supports a subset of DEL functionality (no constraints, no recursion, limited object capabilities, etc.) on top of commercial SQL systems
Trang 19DEL Data Model
The DEL data model integrates a rich type system with primitives to define persistent and derived data The DEL type system consists of built-in types, which can be used to implement user-defined and composite types Composite types are defined using four type constructors: (1) bag, (2) set, (3) list, and (4) tuple
The basic unit of information in VALIDITY is called a fact Facts are instances of predicates, which
are logical constructs characterized by a name and a set of typed attributes A fact specifies values to the attributes of the predicate of which it is an instance There are four kinds of predicates and facts in VALIDITY:
1 Basis facts: Are persistent units of information stored in the database; they are instances of
basis predicates, which have attributes and methods and are organized into inheritance
hierarchies
2 Derived facts: Are deduced from basis facts stored in the database or other derived facts; they are instances of derived predicates
3 Computed predicates and facts: These are similar to derived predicates and facts, but they are
computed by means of imperative code instead of derivation The distance between two points
is a typical example
4 Built-in predicates and facts: These are special computed predicates and facts whose
associated function is provided by VALIDITY Comparison operators are an example
Basis facts have an identity that is analogous to the notion of object identifier in OO databases Further, external mappings can be defined for a predicate; they enable the retrieval of facts (through their fact-IDs) based on the value of some of their unique attributes Basis predicates may also have methods in the OO sense—that is, functions can be invoked in the context of a specific fact
25.8 Applications of Commercial Deductive Database Systems
The LDL system has been applied to the following application domains:
• Enterprise modeling: This domain involves modeling the structure, processes, and constraints
within an enterprise Data related to an enterprise may result in an extended ER model containing hundreds of entities and relationships and thousands of attributes A number of applications useful to designers of new applications (as well as for management) can be developed based on this "metadatabase," which contains dictionary-like information about the whole enterprise
• Hypothesis testing or data dredging: This domain involves formulating a hypothesis,
translating it into an LDL rule set and a query, and then executing the query against given data
Trang 20to test the hypothesis The process is repeated by reformulating the rules and the query This has been applied to genome data analysis in the field of microbiology, where data dredging consists of identifying the DNA sequences from low-level digitized autoradiographs from
experiments performed on E coli bacteria
• Software reuse: The bulk of the software for an application is developed in standard
procedural code, and a small fraction is rule-based and encoded in LDL The rules give rise to
a knowledge base that contains the following elements:
A definition of each C module used in the system
A set of rules that defines ways in which modules can export/import functions, constraints, and so on
The "knowledge base" can be used to make decisions that pertain to the reuse of software subsets Modules can be recombined to satisfy specific tasks, as long as the relevant rules are satisfied This is being experimented with in banking software
25.8.2 VALIDITY Applications
Knowledge independence is a term used by VALIDITY developers to refer to a technical version of business rule independence From a database standpoint, it is a step beyond data independence that brings about integration of data and rules The goal is to achieve streamlining of application
development (multiple applications share rules managed by the database), application maintenance (changes in definitions and in regulations are more easily done), and ease-of-use (interactions are done through high-level tools enabled by the logic foundation) For instance, it simplifies the task of the application programmer who does not need to include tests in his application to guarantee the
soundness of his transactions VALIDITY claims to be able to express, manage, and apply the business rules governing the interactions among various processes within a company
VALIDITY is an appropriate tool for applying software engineering principles to application
development It allows the formal specification of an application in the DEL language, which can then
be directly compiled This eliminates the error-prone step that most methodologies based on relationship conceptual designs and relational implementations require between specification and compilation The following are some application areas of the VALIDITY system:
entity-• Electronic commerce: In electronic commerce, complex customer profiles have to be matched
against target descriptions The profiles are built from various data sources In a current application, demographic data and viewing history compose the viewer’s profiles The matching process is also described by rules, and computed predicates deal with numeric computations The declarative nature of DEL makes the formulation of the matching
algorithm easy
• Rules-governed processes: In a rules-governed process, well-defined rules define the actions
to be performed An application prototype has been developed—its goal being to handle the management of dangerous gases placed in containers—and is coordinated by a large number
of frequently changing regulations The classes of dangerous materials are modeled as DEL classes The possible locations for the containers are constrained by rules, which reflect the regulations In the case of an incident, deduction rules identify potential accidents The main advantage of VALIDITY is the ease with which new regulations are taken into account
• Knowledge discovery: The goal of knowledge discovery is to find new data relationships by
analyzing existing data (see Section 26.2) An application prototype developed by the
University of Illinois utilizes already existing minority student data that has been enhanced with rules in DEL
Trang 21• Concurrent engineering: A concurrent engineering application deals with large amounts of
centralized data, shared by several participants An application prototype has been developed
in the area of civil engineering The design data is modeled using the object-orientation power
of the DEL language When an inconsistency is detected, a new rule models the identified problem Once a solution has been identified, it is turned into a constraint DEL is able to handle transformation of rules into constraints, and it can also handle any closed formula as an integrity constraint
25.9 Summary
In this chapter we introduced deductive database systems, a relatively new branch of database
management This field has been influenced by logic programming languages, particularly by Prolog
A subset of Prolog called Datalog, which contains function-free Horn clauses, is primarily used as the basis of current deductive database work Concepts of Datalog were introduced here We discussed the standard backward-chaining inferencing mechanism of Prolog and a forward-chaining bottom-up strategy The latter has been adapted to evaluate queries dealing with relations (extensional databases),
by using standard relational operations together with Datalog Procedures for evaluating nonrecursive and recursive query processing were discussed and algorithms presented for naive and seminaive evaluation of recursive queries Negation is particularly difficult to deal with in such deductive
databases; a popular concept called stratified negation was introduced in this regard
We surveyed a commercial deductive database system called LDL originally developed at MCC and other experimental systems called CORAL and NAIL! The latest deductive database implementations are called DOODs They combine the power of object orientation with deductive capabilities The most recent entry on the commercial DOOD scene is VALIDITY, which we discussed here briefly The deductive database area is still in an experimental stage Its adoption by industry will give a boost to its development Toward this end, we mentioned practical applications in which LDL and VALIDITY are proving to be very valuable
Exercises
25.1 Add the following facts to the example database in Figure 25.03:
supervise (ahmad,bob), supervise (franklin,gwen)
First modify the supervisory tree in Figure 25.01(b) to reflect this change Then modify the diagram in Figure 25.04 showing the top-down evaluation of the query superior(james, Y)
25.2 Consider the following set of facts for the relation parent(X, Y), where Y is the parent of X:
Trang 22parent(a,aa), parent(a,ab), parent(aa,aaa), parent(aa,aab), parent(aaa,aaaa), parent(aaa,aaab)
Consider the rules
: ancestor(X,Y) :- parent(X,Y)
: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y)
which define ancestor Y of X as above
a Show how to solve the Datalog query
ancestor(aa,X)?
using the naive strategy Show your work at each step
b Show the same query by computing only the changes in the ancestor relation and using that in rule 2 each time
[This question is derived from Bancilhon and Ramakrishnan (1986).]
25.3 Consider a deductive database with the following rules:
ancestor(X,Y) :- father(X,Y)
ancestor(X,Y) :- father(X,Z), ancestor(Z,Y)
Notice that "father(X, Y)" means that Y is the father of X; "ancestor(X, Y)" means that
Y is the ancestor of X Consider the fact base
father(Harry,Issac), father(Issac,John), father(John,Kurt)
a Construct a model theoretic interpretation of the above rules using the given facts
b Consider that a database contains the above relations father(X, Y), another relation brother(X, Y), and a third relation birth(X, B), where B is the birthdate
of person X State a rule that computes the first cousins of the following variety: their fathers must be brothers
c Show a complete Datalog program with fact-based and rule-based literals that
Trang 23computes the following relation: list of pairs of cousins, where the first person is born after 1960 and the second after 1970 You may use "greater than" as a built-in
predicate (Note: Sample facts for brother, birth, and person must also be shown.)
25.4 Consider the following rules:
reachable(X,Y) :- flight(X,Y)
reachable(X,Y) :- flight(X,Z), reachable(Z,Y)
where reachable(X, Y) means that city Y can be reached from city X, and flight(X, Y) means that there is a flight to city Y from city X
a Construct fact predicates that describe the following:
i Los Angeles, New York, Chicago, Atlanta, Frankfurt, Paris, Singapore, Sydney are cities
ii The following flights exist: LA to NY, NY to Atlanta, Atlanta to Frankfurt, Frankfurt to Atlanta, Frankfurt to Singapore, and Singapore to Sydney
(Note: No flight in reverse direction can be automatically assumed.)
b Is the given data cyclic? If so, in what sense?
c Construct a model theoretic interpretation (that is, an interpretation similar to the one shown in Figure 25.03) of the above facts and rules
d Consider the query
reachable(Atlanta,Sydney)?
How will this query be executed using naive and seminaive evaluation? List the series
of steps it will go through
e Consider the following rule defined predicates:
round-trip-reachable(X,Y) :- reachable(X,Y), reachable(Y,X) duration(X,Y,Z)
Draw a predicate dependency graph for the above predicates (Note: duration(X,
Y, Z) means that you can take a flight from X to Y in Z hours.)
f Consider the following query: What cities are reachable in 12 hours from Atlanta? Show how to express it in Datalog Assume built-in predicates like greater-than(X, Y) Can this be converted into a relational algebra statement in a straightforward way? Why or why not?
g Consider the predicate population(X, Y) where Y is the population of city X Consider the following query: List all possible bindings of the predicate pair pair (X, Y), where Y is a city that can be reached in two flights from city X, which has over 1 million people Show this query in Datalog Draw a corresponding query tree
in relational algebraic terms
25.5 Consider the following rules:
Trang 24sgc(X,Y) :- eq(X,Y)
sgc(X,Y) :- par(X,X1), sgc(X1,Y1), par(Y,Y1)
and the EDB, PAR = {(d, g), (e, g), (b, d), (a, d), (a, h), (c, e)} What is the result
of the query
sgc(a,Y)?
Solve using the naive and seminaive methods
25.6 The following rules have been given:
path(X,Y) :- arc(X,Y)
path(X,Y) :- path(X,Z), path(Z,Y)
Suppose that the nodes in a graph are {a, b, c, d} and there are no arcs Let the set of paths, P
= {(a, b), (c, d)} Show that this model is not a fixed point
25.7 Consider the frequent flyer Skymiles program database at an airline It maintains the following relations:
99status(X,Y), 98status(X,Y), 98Miles(X,Y)
The status data refers to passenger X having a status Y for the year, where Y can be regular, silver, gold, or platinum Let the requirements for achieving gold status be expressed by:
99status(X,’gold’) :- 98status(X,’gold’) AND 98Miles(X,Y) AND Y>45000
Trang 2599status(X,’gold’) :- 98status(X,’platinum’) AND 98Miles(X,Y) AND Y>40000
99status(X,’gold’) :- 98status(X,’regular’) AND 98Miles(X,Y) AND Y>50000
98Miles(X, Y) gives the miles Y flown by passenger X in 1998 Assume that similar rules exist for reaching other statuses
a Make up a set of other reasonable rules for achieving platinum status
b Is the above programmable in DATALOG? Why or why not?
c Write a prolog program with the above rules, populate the predicates with sample data, and show how a query like 99status(‘John Smith’, Y) is computed in Prolog
25.8 Consider a tennis tournament database with predicates rank (X, Y): X holds rank Y, beats (X1, X2): X1 beats X2, and superior (X1, X2): X1 is a superior player to X2 Assume that if a player beats another player he is superior to that player and assume that if a player 1 beats player 2 and player 2 is superior to 3 then 1 is superior to 3
Construct a set of recursive rules using the above predicates (Note: We shall hypothetically
assume that there are no "upsets"—that the above rule is always met.)
a Construct a set of recursive rules
b Populate data for beats relation with 10 players playing 3 matches each
c Show a computation of the superior table using this data
d Does the superior have a fixpoint? Why or why not? Explain
For the population of players in the database, assuming John is one of the players, how do you compute "superior (john, X)?" using naive, and seminaive algorithms?
Selected Bibliography
The early developments of the logic and database approach are surveyed by Gallaire et al (1984) Reiter (1984) provides a reconstruction of relational database theory, while Levesque (1984) provides a discussion of incomplete knowledge in light of logic Gallaire and Minker (1978) provide an early book on this topic A detailed treatment of logic and databases appears in Ullman (1989, vol 2), and there is a related chapter in Volume 1 (1988) Ceri, Gottlob, and Tanca (1990) present a comprehensive yet concise treatment of logic and databases Das (1992) is a comprehensive book on deductive databases and logic programming The early history of Datalog is covered in Maier and Warren (1988) Clocksin and Mellish (1994) is an excellent reference on Prolog language
Aho and Ullman (1979) provide an early algorithm for dealing with recursive queries, using the least fixed-point operator Bancilhon and Ramakrishnan (1986) give an excellent and detailed description of the approaches to recursive query processing, with detailed examples of the naive and seminaive approaches Excellent survey articles on deductive databases and recursive query processing include Warren (1992) and Ramakrishnan and Ullman (1993) A complete description of the seminaive approach based on relational algebra is given in Bancilhon (1985) Other approaches to recursive query processing include the recursive query/subquery strategy of Vieille (1986), which is a top-down interpreted strategy, and the Henschen-Naqvi (1984) top-down compiled iterative strategy Balbin and Rao (1987) discuss an extension of the seminaive differential approach for multiple predicates
Trang 26The original paper on magic sets is by Bancilhon et al (1986) Beeri and Ramakrishnan (1987) extends
it Mumick et al (1990) show the applicability of magic sets to nonrecursive nested SQL queries Other approaches to optimizing rules without rewriting them appear in Vieille (1986, 1987) Kifer and Lozinskii (1986) propose a different technique Bry (1990) discusses how the top-down and bottom-up approaches can be reconciled Whang and Navathe (1992) describe an extended disjunctive normal form technique to deal with recursion in relational algebra expressions for providing an expert system interface over a relational DBMS
Chang (1981) describes an early system for combining deductive rules with relational databases The LDL system prototype is described in Chimenti et al (1990) Krishnamurthy and Naqvi (1989)
introduce the "choice" notion in LDL Zaniolo (1988) discusses the language issues for the LDL system A language overview of CORAL is provided in Ramakrishnan et al (1992), and the
implementation is described in Ramakrishnan et al (1993) An extension to support object-oriented features, called CORAL++, is described in Srivastava et al (1993) Ullman (1985) provides the basis for the NAIL! system, which is described in Morris et al (1987) Phipps et al (1991) describe the GLUE-NAIL! deductive database system
Zaniolo (1990) reviews the theoretical background and the practical importance of deductive databases Nicolas (1997) gives an excellent history of the developments leading up to DOODs Falcone et al (1997) survey the DOOD landscape References on the VALIDITY system include Friesen et al (1995), Vieille (1997), and Dietrich et al (1999)
Trang 27The most commonly chosen domain is finite and is called the Herbrand Universe
Note 5
Notice that, in our example, the order of search is quite similar for both forward and backward
chaining However, this is not generally the case
Note 6
For a detailed discussion of fixed points, consult Ullman (1988)
Chapter 26: Data Warehousing And Data Mining
management upward with information at the correct level of detail to support decision making Data
warehousing, on-line analytical processing (OLAP), and data mining provide this functionality In this
chapter we give a broad overview of each of these technologies
The market for such support has been growing rapidly since the mid-1990s As managers become increasingly aware of the growing sophistication of analytic capabilities of these data-based systems, they look increasingly for more sophisticated support for their key organizational decisions
26.1 Data Warehousing
26.1.1 Terminology and Definitions
26.1.2 Characteristics of Data Warehouses
26.1.3 Data Modeling for Data Warehouses
Trang 2826.1.4 Building a Data Warehouse
26.1.5 Typical Functionality of Data Warehouses
26.1.6 Difficulties of Implementing Data Warehouses
26.1.7 Open Issues in Data Warehousing
Because data warehouses have been developed in numerous organizations to meet particular needs, there is no single, canonical definition of the term data warehouse (Note 1) Professional magazine articles and books in the popular press have elaborated on the meaning in a variety of ways Vendors have capitalized on the popularity of the term to help market a variety of related products, and
consultants have provided a large variety of services, all under the data warehousing banner However, data warehouses are quite distinct from traditional databases in their structure, functioning,
performance, and purpose
26.1.1 Terminology and Definitions
W H Inmon (Note 2) characterized a data warehouse as "a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions." Data warehouses provide access
to data for complex analysis, knowledge discovery, and decision making
They support high-performance demands on an organization's data and information Several types of
applications—OLAP, DSS, and data mining applications—are supported OLAP (on-line analytical
processing) is a term used to describe the analysis of complex data from the data warehouse In the
hands of skilled knowledge workers, OLAP tools use distributed computing capabilities for analyses that require more storage and processing power than can be economically and efficiently located on an
individual desktop DSS (decision-support systems) also known as EIS (executive information
systems) (not to be confused with enterprise integration systems) support an organization's leading
decision makers with higher-level data for complex and important decisions Data mining (which we will discuss in detail in Section 26.2) is used for knowledge discovery, the process of searching data for unanticipated new knowledge
Traditional databases support on-line transaction processing (OLTP), which includes insertions,
updates, and deletions, while also supporting information query requirements Traditional relational databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process Thus, they cannot be
optimized for OLAP, DSS, or data mining By contrast, data warehouses are designed precisely to support efficient extraction, processing, and presentation for analytic and decision-making purposes In comparison to traditional databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms
26.1.2 Characteristics of Data Warehouses
To discuss data warehouses and distinguish them from transactional databases calls for an appropriate data model The multidimensional data model (explained in more detail below) is a good fit for OLAP and decision-support technologies In contrast to multidatabases, which provide access to disjoint and usually heterogeneous databases, a data warehouse is frequently a store of integrated data from
multiple sources, processed for storage in a multidimensional model Unlike most transactional
databases, data warehouses typically support time-series and trend analysis, both of which require more historical data than are generally maintained in transactional databases Compared with transactional databases, data warehouses are nonvolatile That means that information in the data warehouse changes far less often and may be regarded as non-real-time with periodic updating In transactional systems, transactions are the unit and are the agent of change to the database; by contrast, data warehouse
Trang 29information is much more coarse grained and is refreshed according to a careful choice of refresh policy, usually incremental Warehouse updates are handled by the warehouse's acquisition component that provides all required preprocessing
We can also describe data warehousing more generally as "a collection of decision support
technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions" (Note 3) Figure 26.01 gives an overview of the conceptual structure of a data warehouse It shows the entire data warehousing process This process includes possible cleaning and reformatting of data before its warehousing At the back end of the process, OLAP, data mining, and DSS may generate new relevant information such as rules; this information is shown in the figure going back into the warehouse The figure also shows that data sources may include files
Data warehouses have the following distinctive characteristics (Note 4)
• multidimensional conceptual view
• generic dimensionality
• unlimited dimensions and aggregation levels
• unrestricted cross-dimensional operations
• dynamic sparse matrix handling
• client-server architecture
• multi-user support
• accessibility
• transparency
• intuitive data manipulation
• consistent reporting performance
• Data marts generally are targeted to a subset of the organization, such as a department, and
are more tightly focused
26.1.3 Data Modeling for Data Warehouses
Multidimensional models take advantage of inherent relationships in data to populate data in
multidimensional matrices called data cubes (These may be called hypercubes if they have more than three dimensions.) For data that lend themselves to dimensional formatting, query performance in multidimensional matrices can be much better than in the relational data model Three examples of dimensions in a corporate data warehouse would be the corporation's fiscal periods, products, and regions
Trang 30A standard spreadsheet is a two-dimensional matrix One example would be a spreadsheet of regional sales by product for a particular time period Products could be shown as rows, with sales revenues for each region comprising the columns (Figure 26.02 shows this two-dimensional organization.) Adding
a time dimension, such as an organization's fiscal quarters, would produce a three-dimensional matrix, which could be represented using a data cube
In Figure 26.03 there is a three-dimensional data cube that organizes product sales data by fiscal quarters and sales regions Each cell could contain data for a specific product, specific fiscal quarter, and specific region By including additional dimensions, a data hypercube could be produced, although more than three dimensions cannot be easily visualized at all or presented graphically The data can be queried directly in any combination of dimensions, bypassing complex database queries Tools exist for viewing data according to the user's choice of dimensions Changing from one dimensional hierarchy
(orientation) to another is easily accomplished in a data cube by a technique called pivoting (also
called rotation) In this technique the data cube can be thought of as rotating to show a different orientation of the axes For example, you might pivot the data cube to show regional sales revenues as rows, the fiscal quarter revenue totals as columns, and the company's products in the third dimension (Figure 26.04) Hence, this technique is equivalent to having a regional sales table for each product separately, where each table shows quarterly sales for that product region by region
Multidimensional models lend themselves readily to hierarchical views in what is known as roll-up
display and drill-down display Roll-up display moves up the hierarchy, grouping into larger units
along a dimension (e.g., summing weekly data by quarter, or by year) Figure 26.05 shows a roll-up display that moves from individual products to a coarser grain of product categories Shown in Figure
26.06, a drill-down display provides the opposite capability, furnishing a finer-grained view, perhaps
disaggregating country sales by region and then regional sales by subregion and also breaking up products by styles
Trang 31The multidimensional storage model involves two types of tables: dimension tables and fact tables A
dimension table consists of tuples of attributes of the dimension A fact table can be thought of as
having tuples, one per a recorded fact This fact contains some measured or observed variable(s) and identifies it (them) with pointers to dimension tables The fact table contains the data and the
dimensions identify each tuple in that data Figure 26.07 contains an example of a fact table that can be viewed from the perspective of multiple dimension tables
Two common multidimensional schemas are the star schema and the snowflake schema The star
schema consists of a fact table with a single table for each dimension (Figure 26.07) The snowflake schema is a variation on the star schema in which the dimensional tables from a star schema are
organized into a hierarchy by normalizing them (Figure 26.08) Some installations are normalizing data warehouses up to the third normal form so that they can access the data warehouse to the finest level of
detail A fact constellation is a set of fact tables that share some dimension tables Figure 26.09 shows
a fact constellation with two fact tables, business results and business forecast These share the
dimension table called product Fact constellations limit the possible queries for the warehouse
Data warehouse storage also utilizes indexing techniques to support high performance access (see
Chapter 6 for a discussion of indexing) A technique called bitmap indexing constructs a bit vector for
each value in a domain (column) being indexed It works very well for domains of low-cardinality
There is a 1 bit placed in the jth position in the vector if the jth row contains the value being indexed
For example, imagine an inventory of 100,000 cars with a bitmap index on car size If there are four car sizes—economy, compact, midsize, and fullsize—there will be four bit vectors, each containing 100,000 bits (12.5 K) for a total index size of 50K Bitmap indexing can provide considerable
input/output and storage space advantages in low-cardinality domains With bit vectors a bitmap index can provide dramatic improvements in comparison, aggregation, and join performance In a star
schema, dimensional data can be indexed to tuples in the fact table by join indexing Join indexes are
traditional indexes to maintain relationships between primary key and foreign key values They relate the values of a dimension of a star schema to rows in the fact table For example, consider a sales fact table that has city and fiscal quarter as dimensions If there is a join index on city, for each city the join index maintains the tuple IDs of tuples containing that city Join indexes may involve multiple
dimensions
Data warehouse storage can facilitate access to summary data by taking further advantage of the nonvolatility of data warehouses and a degree of predictability of the analyses that will be performed using them Two approaches have been used: (1) smaller tables including summary data such as quarterly sales or revenue by product line, and (2) encoding of level (e.g., weekly, quarterly, annual) into existing tables By comparison, the overhead of creating and maintaining such aggregations would likely be excessive in a volatile, transaction-oriented database
Trang 3226.1.4 Building a Data Warehouse
In constructing a data warehouse, builders should take a broad view of the anticipated use of the warehouse There is no way to anticipate all possible queries or analyses during the design phase
However, the design should specifically support ad-hoc querying, that is, accessing data with any
meaningful combination of values for the attributes in the dimension or fact tables For example, a marketing-intensive consumer-products company would require different ways of organizing the data warehouse than would a nonprofit charity focused on fund raising An appropriate schema should be chosen that reflects anticipated usage
Acquisition of data for the warehouse involves the following steps:
• The data must be extracted from multiple, heterogeneous sources, for example, databases or other data feeds such as those containing financial market data or environmental data
• Data must be formatted for consistency within the warehouse Names, meanings, and domains
of data from unrelated sources must be reconciled For instance, subsidiary companies of a large corporation may have different fiscal calendars with quarters ending on different dates, making it difficult to aggregate financial data by quarter Various credit cards may report their transactions differently, making it difficult to compute all credit sales These format
inconsistencies must be resolved
• The data must be cleaned to ensure validity Data cleaning is an involved and complex process that has been identified as the largest labor-demanding component of data warehouse
construction For input data, cleaning must occur before the data are loaded into the
warehouse There is nothing about cleaning data that is specific to data warehousing and that could not be applied to a host database However, since input data must be examined and formatted consistently, data warehouse builders should take this opportunity to check for validity and quality Recognizing erroneous and incomplete data is difficult to automate, and cleaning that requires automatic error correction can be even tougher Some aspects, such as domain checking, are easily coded into data cleaning routines, but automatic recognition of other data problems can be more challenging (For example, one might require that City = 'San Francisco' together with State = 'CT' be recognized as an incorrect combination.) After such problems have been taken care of, similar data from different sources must be coordinated for loading into the warehouse As data managers in the organization discover that their data are being cleaned for input into the warehouse, they will likely want to upgrade their data with the
cleaned data The process of returning cleaned data to the source is called backflushing (see
Figure 26.01)
• The data must be fitted into the data model of the warehouse Data from the various sources must be installed in the data model of the warehouse Data may have to be converted from relational, object-oriented, or legacy databases (network and/or hierarchical) to a
multidimensional model
• The data must be loaded into the warehouse The sheer volume of data in the warehouse makes loading the data a significant task Monitoring tools for loads as well as methods to recover from incomplete or incorrect loads are required With the huge volume of data in the warehouse, incremental updating is usually the only feasible approach The refresh policy will probably emerge as a compromise that takes into account the answers to the following
questions:
• How up-to-date must the data be?
• Can the warehouse go off-line, and for how long?
• What are the data interdependencies?
• What is the storage availability?
• What are the distribution requirements (such as for replication and partitioning)?
• What is the loading time (including cleaning, formatting, copying, transmitting, and overhead such as index rebuilding)?
Trang 33As we have said, databases must strike a balance between efficiency in transaction processing and supporting query requirements (ad hoc user requests), but a data warehouse is typically optimized for access from a decision maker's needs Data storage in a data warehouse reflects this specialization and involves the following processes:
• Storing the data according to the data model of the warehouse
• Creating and maintaining required data structures
• Creating and maintaining appropriate access paths
• Providing for time-variant data as new data are added
• Supporting the updating of warehouse data
• Refreshing the data
• Purging data
Although adequate time can be devoted initially to constructing the warehouse, the sheer volume of data in the warehouse generally makes it impossible to simply reload the warehouse in its entirety later
on Alternatives include selective (partial) refreshing of data and separate warehouse versions
(requiring double storage capacity for the warehouse!) When the warehouse uses an incremental data refreshing mechanism, data may need to be periodically purged; for example, a warehouse that
maintains data on the previous twelve business quarters may periodically purge its data each year
Data warehouses must also be designed with full consideration of the environment in which they will reside Important design considerations include the following:
• Usage projections
• The fit of the data model
• Characteristics of available sources
• Design of the metadata component
• Modular component design
• Design for manageability and change
• Considerations of distributed and parallel architecture
We discuss each of these in turn Warehouse design is initially driven by usage projections; that is, by expectations about who will use the warehouse and in what way Choice of a data model to support this usage is a key initial decision Usage projections and the characteristics of the warehouse's data sources are both taken into account Modular design is a practical necessity to allow the warehouse to evolve with the organization and its information environment In addition, a well-built data warehouse must be designed for maintainability, enabling the warehouse managers to effectively plan for and manage change while providing optimal support to users
You may recall the term metadata from Chapter 2; metadata was defined as the description of a
database including its schema definition The metadata repository is a key data warehouse
component The metadata repository includes both technical and business metadata The first, technical metadata, covers details of acquisition processing, storage structures, data descriptions, warehouse operations and maintenance, and access support functionality The second, business metadata, includes the relevant business rules and organizational details supporting the warehouse
The architecture of the organization's distributed computing environment is a major determining characteristic for the design of the warehouse There are two basic distributed architectures: the
distributed warehouse and the federated warehouse For a distributed warehouse, all the issues of
distributed databases are relevant, for example, replication, partitioning, communications, and
consistency concerns A distributed architecture can provide benefits particularly important to
warehouse performance, such as improved load balancing, scalability of performance, and higher availability A single replicated metadata repository would reside at each distribution site The idea of
the federated warehouse is like that of the federated database: a decentralized confederation of
autonomous data warehouses, each with its own metadata repository Given the magnitude of the challenge inherent to data warehouses, it is likely that such federations will consist of smaller-scale components, such as data marts Large organizations may choose to federate data marts rather than build huge data warehouses
Trang 3426.1.5 Typical Functionality of Data Warehouses
Data Warehousing and Views
Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries Accordingly, data warehouses must provide far greater and more efficient query support than is demanded of
transactional databases The data warehouse access component supports enhanced spreadsheet
functionality, efficient query processing, structured queries, ad hoc queries, data mining, and
materialized views In particular, enhanced spreadsheet functionality includes support for art spreadsheet applications (e.g., MS Excel) as well as for OLAP applications programs These offer preprogrammed functionalities such as the following:
state-of-the-• Roll-up: Data is summarized with increasing generalization (e.g., weekly to quarterly to annually)
• Drill-down: Increasing levels of detail are revealed (the complement of roll-up)
• Pivot: Cross tabulation (also referred as rotation) is performed
• Slice and dice: Performing projection operations on the dimensions
• Sorting: Data is sorted by ordinal value
• Selection: Data is available by value or range
• Derived (computed) attributes: Attributes are computed by operations on stored and derived values
Because data warehouses are free from the restrictions of the transactional environment there is an increased efficiency in query processing Among the tools and techniques used are: query
transformation, index intersection and union, special ROLAP (relational OLAP) and MOLAP
(multidimensional OLAP) functions, SQL extensions, advanced join methods, and intelligent scanning (as in piggy-backing multiple queries)
Improved performance has also been attained with parallel processing Parallel server architectures include symmetric multiprocessor (SMP), cluster, and massively parallel processing (MPP), and combinations of these
Knowledge workers and decision makers use tools ranging from parametric queries to ad hoc queries to data mining Thus, the access component of the data warehouse must provide support of structured queries (both parametric and ad hoc) These together make up a managed query environment Data mining itself uses techniques from statistical analysis and artificial intelligence Statistical analysis can
be performed by advanced spreadsheets, by sophisticated statistical analysis software, or by written programs Techniques such as lagging, moving averages, and regression analysis are also commonly employed Artificial intelligence techniques, which may include genetic algorithms and neural networks, are used for classification and are employed to discover knowledge from the data warehouse that may be unexpected or difficult to specify in queries (We treat data mining in detail in Section 26.2.)
custom-Data Warehousing and Views
Some people have considered data warehouses to be an extension of database views Earlier we mentioned materialized views as one way of meeting requirements for improved access to data (see Chapter 8 for a discussion of views) Materialized views have been explored for their performance enhancement Views, however, provide only a subset of the functions and capabilities of data
warehouses Views and data warehouses are alike in that they both have read-only extracts from databases and subject-orientation However, data warehouses are different from views in the following ways:
Trang 35• Data warehouses exist as persistent storage instead of being materialized on demand
• Data warehouses are not usually relational, but rather multidimensional Views of a relational database are relational
• Data warehouses can be indexed to optimize performance Views cannot be indexed
independent from of the underlying databases
• Data warehouses characteristically provide specific support of functionality; views cannot
• Data warehouses provide large amounts of integrated and often temporal data, generally more than is contained in one database, whereas views are an extract of a database
26.1.6 Difficulties of Implementing Data Warehouses
Some significant operational issues arise with data warehousing: construction, administration, and quality control Project management—the design, construction, and implementation of the
warehouse—is an important and challenging consideration that should not be underestimated The building of an enterprise-wide warehouse in a large organization is a major undertaking, potentially taking years from conceptualization to implementation Because of the difficulty and amount of lead time required for such an undertaking, the widespread development and deployment of data marts may provide an attractive alternative, especially to those organizations with urgent needs for OLAP, DSS, and/or data mining support
The administration of a data warehouse is an intensive enterprise, proportional to the size and
complexity of the warehouse An organization that attempts to administer a data warehouse must realistically understand the complex nature of its administration Although designed for read-access, a data warehouse is no more a static structure than any of its information sources Source databases can
be expected to evolve The warehouse's schema and acquisition component must be expected to be updated to handle these evolutions
A significant issue in data warehousing is the quality control of data Both quality and consistency of data are major concerns Although the data passes through a cleaning function during acquisition, quality and consistency remain significant issues for the database administrator Melding data from heterogeneous and disparate sources is a major challenge given differences in naming, domain
definitions, identification numbers, and the like Every time a source database changes, the data warehouse administrator must consider the possible interactions with other elements of the warehouse
Usage projections should be estimated conservatively prior to construction of the data warehouse and should be revised continually to reflect current requirements As utilization patterns become clear and change over time, storage and access paths can be tuned to remain optimized for support of the
organization's use of its warehouse This activity should continue throughout the life of the warehouse
in order to remain ahead of demand The warehouse should also be designed to accommodate addition and attrition of data sources without major redesign Sources and source data will evolve, and the warehouse must accommodate such change Fitting the available source data into the data model of the warehouse will be a continual challenge, a task that is as much art as science Because there is
continual rapid change in technologies, both the requirements and capabilities of the warehouse will change considerably over time Additionally, data warehousing technology itself will continue to evolve for some time so that component structures and functionalities will continually be upgraded This certain change is excellent motivation for having fully modular design of components
Administration of a data warehouse will require far broader skills than are needed for traditional database administration A team of highly skilled technical experts with overlapping areas of expertise will likely be needed, rather than a single individual Like database administration, data warehouse administration is only partly technical; a large part of the responsibility requires working effectively with all the members of the organization with an interest in the data warehouse However difficult that can be at times for database administrators, it is that much more challenging for data warehouse administrators, as the scope of their responsibilities is considerably broader
Trang 36Design of the management function and selection of the management team for a database warehouse are crucial Managing the data warehouse in a large organization will surely be a major task Many commercial tools are already available to support management functions Effective data warehouse management will certainly be a team function, requiring a wide set of technical skills, careful
coordination, and effective leadership Just as we must prepare for the evolution of the warehouse, we must also recognize that the skills of the management team will, of necessity, evolve with it
26.1.7 Open Issues in Data Warehousing
There has been much marketing hyperbole surrounding the term "data warehouse"; the exaggerated expectations will probably subside, but the concept of integrated data collections to support
sophisticated analysis and decision support will undoubtedly endure
Data warehousing as an active research area is likely to see increased research activity in the near future as warehouses and data marts proliferate Old problems will receive new emphasis; for example, data cleaning, indexing, partitioning, and views could receive renewed attention
Academic research into data warehousing technologies will likely focus on automating aspects of the warehouse that currently require significant manual intervention, such as the data acquisition, data quality management, selection and construction of appropriate access paths and structures, self-maintainability, functionality, and performance optimization Application of active database
functionality (see Section 23.1) into the warehouse is likely also to receive considerable attention Incorporation of domain and business rules appropriately into the warehouse creation and maintenance process may make it more intelligent, relevant, and self-governing
Commercial software for data warehousing is already available from a number of vendors, focusing principally on management of the data warehouse and OLAP/DSS applications Other aspects of data warehousing, such as design and data acquisition (especially cleaning), are being addressed primarily
by teams of in-house IT managers and consultants
26.2 Data Mining
26.2.1 An Overview of Data Mining Technology
26.2.2 Association Rules
26.2.3 Approaches to Other Data Mining Problems
26.2.4 Applications of Data Mining
26.2.5 State-of-the-Art of Commercial Data Mining Tools
Over the last three decades, many organizations have generated a large amount of machine-readable data in the form of files and databases To process this data, we have the database technology available
to us that supports query languages like SQL The problem with SQL is that it is a structured language that assumes the user is aware of the database schema SQL supports operations of relational algebra that allow a user to select from tables (rows and columns of data) or join related information from tables based on common fields In the last section we saw that data warehousing technology affords types of functionality, that of consolidation, aggregation, and summarization of data It lets us view the same information along multiple dimensions In this section, we will focus our attention on yet another
very popular area of interest known as data mining As the term connotes, data mining refers to the
mining or discovery of new information in terms of patterns or rules from vast amounts of data To be practically useful, data mining must be carried out efficiently on large files and databases To date, it is
not well-integrated with database management systems
Trang 37We will briefly review the state of the art of this rather extensive field of data mining, which uses techniques from such areas as machine learning, statistics, neural networks, and genetic algorithms We will highlight the nature of the information that is discovered, the types of problems faced in databases, and potential applications We also survey the state of the art of a large number of commercial tools available (see Section 26.2.5) and describe a number of research advances that are needed to make this area viable
26.2.1 An Overview of Data Mining Technology
Data Mining and Data Warehousing
Data Mining as a Part of the Knowledge Discovery Process
Goals of Data Mining and Knowledge Discovery
Types of Knowledge Discovered during Data Mining
In reports such as the very popular Gartner Report (Note 5), data mining has been hailed as one of the top technologies for the near future In this section we relate data mining to the broader area called knowledge discovery and contrast the two by means of an illustrative example We also discuss a number of data mining techniques and algorithms in Section 26.2.3
Data Mining and Data Warehousing
The goal of a data warehouse is to support decision making with data Data mining can be used in conjunction with a data warehouse to help with certain types of decisions Data mining can be applied
to operational databases with individual transactions To make data mining more efficient, the data warehouse should have an aggregated or summarized collection of data Data mining helps in
extracting meaningful new patterns that cannot be found necessarily by merely querying or processing data or metadata in the data warehouse Data mining applications should therefore be strongly
considered early, during the design of a data warehouse Also, data mining tools should be designed to facilitate their use in conjunction with data warehouses In fact, for very large databases running into terabytes of data, successful use of database mining applications will depend first on the construction
of a data warehouse
Data Mining as a Part of the Knowledge Discovery Process
Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more
than data mining The knowledge discovery process comprises six phases (Note 6): data selection, data cleansing, enrichment, data transformation or encoding, data mining, and the reporting and display of the discovered information
As an example, consider a transaction database maintained by a specialty consumer goods retailer Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount A variety of new knowledge can be discovered by KDD
processing on this client database During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected The data cleansing
process then may correct invalid zip codes or eliminate records with incorrect phone prefixes
Enrichment typically enhances the data with additional sources of information For example, given the
client names and phone numbers, the store may purchase other data about age, income, and credit
rating and append them to each record Data transformation and encoding may be done to reduce the
amount of data For instance, item codes may be grouped in terms of product categories into audio,
Trang 38video, supplies, electronic gadgets, camera, accessories, and so on Zip codes may be aggregated into geographic regions, incomes may be divided into ten ranges, and so on Earlier, in Figure 26.01, we showed a step called cleaning as a precursor to the data warehouse creation If data mining is based on
an existing warehouse for this retail store chain, we would expect that the cleaning has already been
applied It is only after such preprocessing that data mining techniques are used to mine different rules
and patterns For example, the result of mining may be to discover:
• Association rules—e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget
• Sequential patterns—e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, and within six months an accessory item A customer who buys more than twice in the lean periods may be likely to buy at least once during Christmas period
• Classification trees—e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by affinity for types of items, and some revealing statistics may be generated for such classes
We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such as age, income-group, place of residence, to what and how much the customers purchase This information can then be utilized to plan additional store locations based on demographics, to run store promotions, to combine items in advertisements, or to plan seasonal marketing strategies As this retail-store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions
The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualizations
Goals of Data Mining and Knowledge Discovery
Broadly speaking, the goals of data mining fall into the following classes: prediction, identification, classification, and optimization
• Prediction—Data mining can show how certain attributes within the data will behave in the
future Examples of predictive data mining include the analysis of buying transactions to predict what consumers will buy under certain discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits In such applications, business logic is used coupled with data mining In a scientific context, certain seismic wave patterns may predict an earthquake with high probability
• Identification—Data patterns can be used to identify the existence of an item, an event, or an
activity For example, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence The
area known as authentication is a form of identification It ascertains whether a user is indeed
a specific user or one from an authorized class; it involves a comparison of parameters or images or signals against a database
• Classification—Data mining can partition the data so that different classes or categories can
be identified based on combinations of parameters For example, customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, and infrequent shoppers This classification may be used in different analyses of customer buying transactions as a post-mining activity Sometimes classification based on common domain knowledge is used as an input to decompose the mining problem and make it simpler For instance, health foods, party foods, or school lunch foods are distinct categories in the supermarket business It makes sense to analyze relationships within and across categories as separate problems Such categorization may be used to encode the data appropriately before subjecting it to further data mining
Trang 39• Optimization—One eventual goal of data mining may be to optimize the use of limited
resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints As such, this goal of data mining resembles the objective function used in operations research problems that deals with optimization under constraints
The term data mining is currently used in a very broad sense In some situations it includes statistical analysis and constrained optimization as well as machine learning There is no sharp line separating data mining from these disciplines It is beyond our scope, therefore, to discuss in detail the entire range of applications that make up this vast body of work
Types of Knowledge Discovered during Data Mining
The term "knowledge" is very broadly interpreted as involving some degree of intelligence Knowledge
is often classified as inductive and deductive We discussed discovery of deductive knowledge in Chapter 25 Data mining addresses inductive knowledge Knowledge can be represented in many forms: in an unstructured sense, it can be represented by rules, or propositional logic In a structured form, it may be represented in decision trees, semantic networks, neural networks, or hierarchies of classes or frames The knowledge discovered during data mining can be described in five ways, as follows
1 Association rules—These rules correlate the presence of a set of items with another range of
values for another set of variables Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c
2 Classification hierarchies—The goal is to work from an existing set of events or transactions
to create a hierarchy of classes Examples: (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions (2) A model may be developed for the factors that determine the desirability of location of a store on a 1–10 scale (3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability
3 Sequential patterns—A sequence of actions or events is sought Example: If a patient
underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months Detection of sequential patterns is equivalent to detecting
association among events with certain temporal relationships
4 Patterns within time series—Similarities can be detected within positions of the time series
Three examples follow with the stock market price data as a time series: (1) Stocks of a utility company ABC Power and a financial company XYZ Securities show the same pattern during
1998 in terms of closing stock price (2) Two products show the same selling pattern in summer but a different one in winter (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions
5 Categorization and segmentation—A given population of events or items can be partitioned
(segmented) into sets of "similar" elements Examples: (1) An entire population of treatment data on a disease may be divided into groups based on the similarity of side effects produced (2) The adult population in the United States may be categorized into five groups from "most likely to buy" to "least likely to buy" a new product (3) The web accesses made by a
collection of users against a set of documents (say, in a digital library) may be analyzed in terms of the keywords of documents to reveal clusters or categories of users
For most applications, the desired knowledge is a combination of the above types We expand on each
of the above knowledge types in the following subsections
Trang 4026.2.2 Association Rules
Basic Algorithms for Finding Association Rules
Association Rules among Hierarchies
Negative Associations
Additional Considerations for Association Rules
One of the major technologies in data mining involves the discovery of association rules The database
is regarded as a collection of transactions, each involving a set of items A common example is that of market-basket data Here the market basket corresponds to what a consumer buys in a supermarket during one visit Consider four such transactions in a random sample:
An association rule is of the form
X Y,
where X = , and Y = are sets of items, with x i and y j being distinct items for all i and all j This
association states that if a customer buys X, he or she is also likely to buy Y In general, any
association rule has the form LHS (left-hand side) RHS (right-hand side), where LHS and RHS are sets
of items Association rules should supply both support and confidence
The support for the rule LHS RHS is the percentage of transactions that hold all of the items in the
union, the set LHS RHS If the support is low, it implies that there is no overwhelming evidence that items in LHS RHS occur together, because the union happens in only a small fraction of transactions The rule Milk Juice has 50% support, while Bread Juice has only 25% support Another term for
support is prevalence of the rule
To compute confidence we consider all transactions that include items in LHS The confidence for the
association rule LHS RHS is the percentage (fraction) of such transactions that also include RHS
Another term for confidence is strength of the rule
For Milk Juice, the confidence is 66.7% (meaning that, of three transactions in which milk occurs, two contain juice) and bread juice has 50% confidence (meaning that one of two transactions containing bread also contains juice.)
As we can see, support and confidence do not necessarily go hand in hand The goal of mining
association rules, then, is to generate all possible rules that exceed some minimum user-specified support and confidence thresholds The problem is thus decomposed into two subproblems: