1. Trang chủ
  2. » Công Nghệ Thông Tin

Fundamentals of Database systems 3th edition PHẦN 9 pptx

87 500 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Fundamentals of Database Systems 3rd Edition PHẦN 9 PPTX
Trường học University
Chuyên ngành Database Systems
Thể loại lecture slides
Định dạng
Số trang 87
Dung lượng 397,92 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

26.1.4 Building a Data Warehouse26.1.5 Typical Functionality of Data Warehouses 26.1.6 Difficulties of Implementing Data Warehouses 26.1.7 Open Issues in Data Warehousing Because data wa

Trang 2

constants, or if a variable is repeated twice in the rule head, it can easily be rectified: a constant c is replaced by a variable X, and a predicate equal(X, c) is added to the rule body Similarly, if a variable Y appears twice in a rule head, one of those occurrences is replaced by another variable Z, and

a predicate equal(Y, Z) is added to the rule body

The evaluation of a nonrecursive query can be expressed as a tree whose leaves are the base relations What is needed is appropriate application of the relational operations of SELECT, PROJECT, and JOIN, together with set operations of UNION and SET DIFFERENCE, until the predicate in the query gets evaluated An outline of an inference algorithm GET_EXPR(Q) that generates a relational

expression for computing the result of a DATALOG query Q = p(arg1, arg2, , argn) can informally be stated as follows:

1 Locate all rules S whose head involves the predicate p If there are no such rules, then p is a fact-defined predicate corresponding to some database relation Rp; in this case, one of the following expressions is returned and the algorithm is terminated (we use the notation $i to refer to the name of the ith attribute of relation Rp);

a If all arguments are distinct variables, the relational expression returned is Rp

b If some arguments are constants or if the same variable appears in more than one argument position, the expression returned is

SELECT<condition>(Rp), where the selection <condition> is a conjunctive condition made up of a number

of simple conditions connected by AND, and constructed as follows:

i if a constant c appears as argument i, include a simple condition ($i = c) in the conjunction

ii If the same variable appears in both argument locations j and k, include a condition ($j = $k) in the conjunction

c For an argument that is not present in any predicate, a unary relation containing values that satisfy all conditions is constructed Since the rule is assumed to be safe, this unary relation must be finite

2 At this point, one or more rules Si, i = 1, 2, , n, n > 0 exist with predicate p as their head For each such rule Si, generate a relational expression as follows:

a Apply selection operations on the predicates in the RHS for each such rule, as discussed in Step 1

b A natural join is constructed among the relations that correspond to the predicates in the body of the rule Si over the common variables For arguments that gave rise to the unary relations in Step 1(c), the corresponding relations are brought as members into the natural join Let the resulting relation from this join be Rs

c If any built-in predicate X h Y was defined over the arguments X and Y, the result of the join is subjected to an additional selection:

SELECTX h Y(Rs),

d Repeat Step 2(b) until no more built-in predicates apply

3 Take the UNION of the expressions generated in Step 2 (if more than one rule exists with predicate p as its head)

25.5.4 Concepts for Recursive Query Processing in Datalog

Trang 3

Naive Strategy

Seminaive Strategy

The Magic Set Rule Rewriting Technique

Query processing can be separated into two approaches:

• Pure evaluation approach: Creating a query evaluation plan that produces an answer to the

query

• Rule rewriting approach: Optimizing the plan into a more efficient strategy

Many approaches have been presented for both recursive and nonrecursive queries We discussed an approach to nonrecursive query evaluation earlier Here we first define some terminology for recursive queries, then discuss the naive and seminaive approaches to query evaluation—which generate simple

plans—and then present the magic set approach—which is an optimization based on rule rewriting

We have already seen examples involving recursive rules where the same predicate occurs in the head and in the body of a rule Another example is

ancestor(X,Y) :- ancestor(X,Z), parent(Z,Y)

which states that Y is an ancestor of X if Z is an ancestor of X and Y is a parent of Z It is in

conjunction with the rule

ancestor(X,Y) :- parent (X,Y)

which states that if Y is a parent of X, then Y is an ancestor of X

A rule is said to be linearly recursive if the recursive predicate appears once and only once in the RHS

of the rule For example,

sg(X,Y) :- parent(X,XP), parent(Y,YP), sg(XP,YP)

is a linear rule in which the predicate sg (same-generation cousins) is used only once in RHS The rule states that X and Y are same-generation cousins if their parents are same-generation cousins The rule

Trang 4

ancestor(X,Y) :- ancestor(X,Z), parent(Z,Y)

is called left linearly recursive, while the rule

ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y),

is called right linearly recursive

Notice that the rule

ancestor(X,Y) :- ancestor(X,Z), ancestor(Z,Y)

is not linearly recursive It is believed that most "real-life" rules can be described as linear recursive

rules; algorithms have been defined to execute linear sets of rules efficiently The preceding definitions become more involved when a set of rules with predicates that occur on both the LHS and the RHS of rules are considered

A predicate whose relation is stored in the database is called an extensional database (EDB)

predicate, while a predicate for which the corresponding relation is defined by logical rules is called an intensional database (IDB) predicate Given a Datalog program with relations corresponding to the

predicates, the "if" symbol, :-, may be replaced by an equality to form Datalog equations, without any

loss of meaning The resulting set of Datalog equations could potentially have many solutions In a set

of relations for the EDB predicates, say R1, R2, , Rn, a fixed point of the Datalog equations is a solution for the relations corresponding to the IDB predicates of those equations

The fixed point with respect to the given EDB relations, along with those relations, forms a model of the rules from which the Datalog equations were derived However, it is not true that every model of a set of Datalog rules is a fixed point of the corresponding Datalog equations, because the model may have "too many" facts It turns out that Datalog programs each have a unique minimal model

containing any given EDB relations, and this also corresponds to the unique minimal fixed point, with respect to those EDB relations

Formally, given a family of solutions Si = P1(i), ,Pm(i), to a given set of equations, the least fixed point

of a set of equations is obtained by finding the solution whose corresponding relations are the smallest proper subsets for all relations For example, we say S1 1 S2, if relation Pk(1) is a subset of relation Pk(2) for all k, 1 1 k 1 m Fixpoint theory was first developed in the field of recursion theory as a tool for explaining recursive functions Since Datalog has an ability to express recursion, fixpoint theory is well suited for describing the semantics of recursive functions

For example, if we represent a directed graph by the predicate edge(X,Y) such that edge (X,Y) is true if and only if there is an edge from node X to node Y in the graph, the paths in the graph may be

expressed by the following rules:

Trang 5

path(X,Y) :- edge(X,Y)

path(X,Y) :- path(X,Z), path (Z,Y)

Notice that there are other ways of defining paths recursively Let us assume that relations P and A

correspond to the predicates path and edge in the preceding rules The transitive closure of relation P

contains all possible pairs of nodes that have a path between them, and it corresponds to the least point solution corresponding to the equations that result from the preceding rules (Note 6) These rules can be turned into a single equation for the relation P corresponding to the predicate edge

fixed-P(X,Y) = A(X,Y) D pX,Y (P(X,Z)P(Z,Y))

Suppose that the nodes are 3,4,5 and A = {(3,4), (4,5)} From the first and second rules we can infer that (3,4), (4,5) and (3,5) are in P We need not look for any other paths, because P = {(3,4),(4,5),(3,5)}

is a solution of the above equation:

{(3,4),(4,5),(3,5)} = {(3,4),(4,5)}D p X,Y ({(3,4),(4,5),(3,5)}{ (3,4),(4,5),(3.5)})

This solution constitutes a proof theoretic meaning of the rules, as it was derived from the EDB relation

A, using just the rules It is also the minimal model of the rules or the least fixed point of the equation

For evaluating a set of Datalog rules (equations) that may contain recursive rules, a large number of strategies have been proposed, details of which are beyond our scope Here we illustrate three

important techniques: the naive strategy, the seminaive strategy, and the use of magic sets

Naive Strategy

The naive evaluation method is a pure evaluation, bottom-up strategy which computes the least model

of a Datalog program It is an iterative strategy and at each iteration all rules are applied to the set of tuples produced thus far to generate all implicit tuples This iterative process continues until no more new tuples can be generated

The naive evaluation process does not take into account query patterns As a result, a considerable amount of redundant computation is done We present two versions of the naive method, called Jacobi

Trang 6

and Gauss-Seidel solution methods; these methods get their names from well known algorithms for the iterative solution of systems of equations in numerical analysis

Assume the following system of relational equations, formed by replacing the :- symbol by an equality sign in a Datalog program

Ri = Ei (R1, R2, , Rn)

The Jacobi method proceeds as follows Initially, the variable relations Ri are set equal to the empty set Then, the computation Ri = Ei (R1, R2, , Rn), i = 1, , n is iterated until none of the Ri changes

between two consecutive iterations (i.e., until the Ri reach a fixpoint)

Algorithm 25.1 Jacobi naive strategy

Input: A system of algebraic equations and an EDB

Output: The values of the variable relations R1, R2, , Rn

Trang 7

The convergence of the Jacobi method can be slightly improved if, at each step k, in order to compute the new value Ri(k) , we substitute in Ei the values of Rj(k) that have just been computed in the same iteration instead of the old values Rj(k - 1) This variant of the Jacobi method is called the Gauss-Seidel method, which produces the same result as the Jacobi algorithm Consider the following example where ancestor(X, Y) means X is ancestor of Y; parent(X, Y) means X is parent of Y

ancestor(X,Y) :- parent(X,Y)

ancestor(X,Y) :- ancestor(X,Z), parent(Z,Y)

If we define a relation A for the predicate ancestor and P for the parent, the Datalog equation for the above rules can be written in the form:

A(X,Y) = pX,Y (A(X,Z)P(Z,Y)) D A(X,Y)

Suppose the EDB is given as P = {(bert, alice), (bert, george), (alice, derek), (alice, pat), (derek, frank)} Let us follow the Jacobi algorithm The parent tree looks as in Figure 25.09

Initially, we set A(0) = , enter the repeat loop, and set condition = true We then initialize S1= A = ,

then compute the first value of A Since the first join involves an empty relation, we get

A(1) = P = {(bert, alice),(bert, george),(alice, derek),(alice, pat),(derek, frank)}

A(1) includes parents as ancestors A(1) S1, thus condition = false We therefore enter the second iteration with S1 set to A(1) Computing the value of A again, we get,

A(2) = P = {(bert,alice),(bert,george),(alice,derek),(alice,pat),(derek,frank), (bert,derek), (bert,pat), (alice,frank)}

Trang 8

It can be seen that A(2) = A(1) D {(bert,derek), (bert,pat), (alice,frank)} Note that A(2) now includes grandparents as ancestors besides parents Since A(2) S1, we iterate again, setting S1 to A(2):

A(3) = P = {(bert,alice),(bert,george),(alice,derek),(alice,pat),(derek,frank), (bert,derek), (bert,pat), (alice,frank), (bert,frank)}

Now, A(3) = A(2) D {(bert,frank)} A(3) now has great grandparents included among ancestors Since A(3)

is different from S1, we enter the next iteration, setting S1 = A(3) We now get,

A(4) = P = {(bert,alice),(bert,george),(alice,derek),(alice,pat),(derek,frank), (bert,derek), (bert,pat), (alice,frank), (bert,frank)}

Finally, A(4) = A(3) = S1, the evaluation is finished Intuitively, from the above parental hierarchy, it is obvious that all ancestors have been computed

Seminaive Strategy

Seminaive evaluation is a bottom-up technique designed to eliminate redundancy in the evaluation of tuples at different iterations This method does not use any information about the structure of the program There are two possible settings of the seminaive algorithm: the (pure) seminaive and the pseudo rewriting seminaive

Consider the Jacobi algorithm Let Ri(k) be the temporary value of relation Ri at iteration Step k The differential of Ri at Step k of the iteration is defined as,

Di(k) = Ri(k) -Ri(k-1)

When the whole system is linear, Di can be substituted for Ri in the Jacobi or Gauss-Seidel algorithms: the result is obtained by the union of the newly obtained term and the old one

Trang 9

Algorithm 25.2 Seminaive strategy

input: A system of algebraic equations and an EDB

output: The values of the variable relations R1, R2, , Rn

D(0) = , A(0) =

D(1) = P = {(bert,alice),(bert,george),

(alice,derek),(alice,pat), (derek,frank)}

Hence,

Trang 10

(bert,pat), (alice,frank)}- A(1)

= {(bert,derek), (bert,pat), (alice,frank)}

The Magic Set Rule Rewriting Technique

The problem addressed by the magic sets rule rewriting technique is that frequently a query asks not for the entire relation corresponding to an intentional predicate but for a small subset of this relation Consider the following program:

Trang 11

sg(X,Y) :- flat(X,Y)

sg(X,Y) :- up(X,U), sg(U,V), down(V,Y)

Here, sg is a predicate ("same-generation cousin"), and the head of each of the two rules is the atomic formula sg(X, Y) The other predicates found in the rules are flat, up, and down These are presumably stored extensionally as facts, while the relation for sg is intentional—that is, defined only

by the rules For a query like sg(john, Z)—that is, "who are the same generation cousins of

John?"—asked of the predicate, our answer to the query must examine only the part of the database

that is relevant—namely the part that involves individuals somehow connected to John

A top-down, or backward-chaining search would start from the query as a goal and use the rules from head to body to create more goals; none of these goals would be irrelevant to the query, although some might cause us to explore paths that happen to "deadend." On the other hand, a bottom-up or forward-chaining search, working from the bodies of the rules to the heads, would cause us to infer sg facts that would never even be considered in the top-down search Yet bottom-up evaluation is desirable because

it avoids the problems of looping and repeated computation that are inherent in the top-down approach, and allow us to use set-at-a-time operations, such as relational joins

Magic sets rule rewriting is a technique that allows us to rewrite the rules as a function of the query form only—that is, it considers which arguments of the predicate are bound to constants and which are variable, so that the advantages of top-down and bottom-up methods are combined The technique focuses on the goal inherent in the top-down evaluation but combines this with the looping freedom, easy termination testing, and efficient evaluation of bottom-up evaluation Instead of giving the method, of which many variations are known and used in practice, we explain the idea with an

example

Given the previously stated rules and the query sg(john, Z), a typical magic sets transformation of the rules would be

sg(X,Y) :-magic-sg(X), flat (X,Y)

sg(X,Y) :-magic-sg(X), up(X,U), sg(U,V), down(V,Y)

magic-sg(U) :-magic-sg(X), up(X,U)

Trang 12

While the magic sets technique was originally developed to deal with recursive queries, it is applicable

to nonrecursive queries as well Indeed, it has been adapted to deal with SQL queries (which contain features such as grouping, aggregation, arithmetic conditions, and multiset relations that are not present

in pure logic queries), and it has been found to be useful for evaluating nonrecursive "nested" SQL queries

25.5.5 Stratified Negation

A deductive database query language can be enhanced by permitting negated literals in the bodies of rules in programs However, the important property of rules, called the minimal model, which we

discussed earlier, does not hold In the presence of negated literals, a program may not have a minimal

or least model For example, the program

p(a):- not p(b)

has two minimal models: {p(a)} and {p(b)}

A detailed analysis of the concept of negation is beyond our scope But for practical purposes, we next

discuss stratified negation, an important notion used in deductive system implementations

The meaning of a program with negation is usually given by some "intended" model The challenge is

to develop algorithms for choosing an intended model that does the following:

1 Makes sense to the user of the rules

2 Allows us to answer queries about the model efficiently

In particular, it is desirable that the model work well with the magic sets transformation, in the sense that we can modify the rules by some suitable generalization of magic sets, and the resulting rules allow (only) the relevant portion of the selected model to be computed efficiently (Alternatively, other efficient evaluation techniques must be developed.)

One important class of negation that has been extensively studied is stratified negation A program is

stratified if there is no recursion through negation Programs in this class have a very intuitive

semantics and can be efficiently evaluated The example that follows describes a stratified program Consider the following program P2:

r1: ancestor(X,Y) :- parent (X,Y)

r2: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y)

r3: nocyc(X,Y):- ancestor(X,Y), not (ancestor(Y,X))

Trang 13

Notice that the third rule has a negative literal in its body This program is stratified because the definition of the predicate nocyc depends (negatively) on the definition of ancestor, but the

definition of ancestor does not depend on the definition of nocyc We are not equipped to give a

more formal definition without giving additional notation and definitions A bottom-up evaluation of P2 would first compute a fixed point of rules r1 and r2 (the rules defining ancestor) Rule r3 is applied only when all the ancestor facts are known

A natural extension of stratified programs is the class of locally stratified programs Intuitively, a

program P is locally stratified for a given database if, when we substitute constants for variables in all

possible ways, the resulting instantiated rules do not have any recursion through negation

25.6 Deductive Database Systems

25.6.1 The LDL System

25.6.2 NAIL!

25.6.3 The CORAL System

The founding event of the deductive database field can be considered to be the Toulouse workshop on

"Logic and Databases" organized by Gallaire, Minker, and Nicolas in 1977 The next period of the explosive growth started with the setting up of the MCC (Microelectronics and Computer Technology Corporation), which was a reaction to the Japanese Fifth Generation Project Several experimental deductive database systems have been developed and a few have been commercially deployed In this section we briefly review three different implementations of the ideas presented so far: LDL, NAIL!, and CORAL

25.6.1 The LDL System

Background, Motivation, and Overview

The LDL Data Model and Language

The Logic Data Language (LDL) project at Microelectronics and Computer Technology Corporation (MCC) was started in 1984 with two primary objectives:

• To develop a system that extends the relational model yet exploits some of the desirable features of an RDBMS (relational database management system)

• To enhance the functionality of a DBMS so that it works as a deductive DBMS and also supports the development of general-purpose applications

The resulting system is now a deductive DBMS made available as a product In this section, we briefly survey the highlights of the technical approach taken by LDL and consider its important features

Background, Motivation, and Overview

The design of the LDL language may be viewed as a rule-based extension to domain calculus-based languages (see Section 9.4) The LDL system has tried to combine the expressive capability of Prolog with the functionality and facility of a general-purpose DBMS The main drawback experienced by earlier systems that coupled Prolog with an RDBMS is that Prolog is navigational (tuple-at-a-time)

Trang 14

whereas in RDBMSs the user formulates a correct query and leaves the optimization of query

execution to the system The navigational nature of Prolog is manifested in the ordering of rules and goals to achieve an optimal execution and termination Two options are available:

• Make Prolog more "database-like" by adding navigational database management features (For an example of navigational query language, see the network model DML in Section C.4

of Appendix C.)

• Modify Prolog into a general-purpose declarative logic language

The latter option was chosen in LDL, yielding a language that is different from Prolog in its constructs and style of programming in the following ways:

• Rules are compiled in LDL

• There is a notion of a "schema" of the fact base in LDL at compile time The fact base is freely updated at run-time Prolog, on the other hand, treats facts and rules identically, and it subjects facts to interpretation when they are changed

• LDL does not follow the resolution and unification technique used in Prolog systems that are based on backward chaining

• The LDL execution model is simpler, based on the operation of matching and the computation

of "least fixed points." These operators, in turn, use simple extensions to the relation algebra

The first LDL implementation, completed in 1987, was based on a language called FAD A later implementation, completed in 1988, is called SALAD and underwent further changes as it was tested against the "real-life" applications described in Section 25.8 The current prototype is an efficient portable system for UNIX that assumes a single tuple get next interface between the compiled LDL program and an underlying fact manager

The LDL Data Model and Language

With the design philosophy of LDL being to combine the declarative style of relational languages with the expressive power of Prolog, constructs in Prolog such as negation, set-of, updates, and cut have been dropped Instead, the declarative semantics of Horn clauses was extended to support complex

terms through the use of function symbols, called functors in Prolog

A particular employee record can therefore be defined as follows:

Employee (Name (John Doe), Job(VP),

Education ({(High school, 1961),

(College (Fergusson, bs, physics), 1965),

(College (Michigan, phd, ie), 1976)}))

In the preceding record, VP is a simple term, whereas education is a complex term that consists of a term for high school and a nested relation containing the term for college and the year of graduation LDL thus supports complex objects with an arbitrarily complex structure including lists, set terms,

Trang 15

trees, and nested relations We can think of a compound term as a Prolog structure with the function symbol as the functor

LDL allows updates in the bodies of rules For instance, a rule

happy (Dept, Raise, Name) <-

emp (Name, Dept, Sal), Newsal = Sal+Raise

-emp (Name, Dept, -), +emp(Name,Dept,Newsal)

Even though LDL’s semantics is defined in a bottom-up fashion (for example, via stratification), the implementor can use any execution that is faithful to this declarative semantics In particular, the execution can proceed bottom-up or top-down, or it may be a hybrid execution These choices enable the compiler/optimizer to be selective in customizing the most appropriate modes of execution for the given program The LDL compiler and optimizer can select from among several strategies: pipelined or lazy pipelined execution, materialized or lazy materialized execution

25.6.2 NAIL!

The NAIL! (Not Another Implementation of Logic!) project was started at Stanford University in 1985 The initial goal was to study the optimization of logic by using the database-oriented "all-solutions" model The aim of the project was to support the optimal execution of Datalog goals over an RDBMS Assuming that a single workable strategy was inappropriate for all logic programs in general, an extensible architecture was developed, which could be enhanced through progressive additions

Trang 16

In collaboration with the MCC group, this project was responsible for the idea of magic sets and the first work on regular recursions In addition, many important contributions to coping with negation and aggregation on logical rules were made by the project, including stratified negation, well-founded negation, and modularly stratified negation The architecture of NAIL! is illustrated in Figure 25.10

The preprocessor rewrites the source NAIL! program by isolating "negation" and "set" operators, and

by replacing disjunction with several conjunctive rules After preprocessing, the NAIL! program is represented through its predicates and rules The strategy selection module takes as input the user’s goal and produces as output the best execution strategies for solving the user’s goal and all the other goals related to it, using the internal language ICODE

The ICODE statements produced as a result of the strategy selection process are optimized and then executed through an interpreter, which translates ICODE retrieval statements to SQL when needed

An initial prototype system was built but later abandoned because the purely declarative paradigm was found to be unworkable for many applications The revised system uses a core language, called GLUE, which is essentially single logical rules, with the power of SQL statements, wrapped in conventional language constructs such as loops, procedures, and modules The original NAIL! language becomes a view mechanism for GLUE; it permits fully declarative specifications in situations where

declarativeness is appropriate

25.6.3 The CORAL System

The CORAL system, which was developed at the University of Wisconsin at Madison, builds on experience gained from the LDL project Like LDL, the system provides a declarative language based

on Horn clauses with an open architecture There are many important differences, however, in both the language and its implementation The CORAL system can be seen as a database programming

language that combines important features of SQL and Prolog

From a language standpoint, CORAL adapts LDL’s set-grouping construct to be closer to SQL’s GROUP BY construct For example, consider

budget(Dname,sum(<Sal>)) :- dept(Dname,Ename,Sal)

This rule computes one budget tuple for each department, and each salary value is added as often as there are people with that salary in the given department In LDL, the grouping and the sum operation

cannot be combined in one step; more importantly, the grouping is defined to produce a set of salaries

for each department Therefore, computing the budget is harder in LDL A related point is that SQL

supports a multiset semantics for queries when the DISTINCT clause is not specified CORAL supports

such a multiset semantics as well Thus the following rule can be defined to compute either a set of tuples or a multiset of tuples in CORAL, as occurs in SQL:

Trang 17

budget2(Dname,Sal) :- dept(Dname,Ename,Sal)

This raises an important point: How can a user specify which semantics (set or multiset) is desired? In

SQL, the keyword DISTINCT is used; similarly, an annotation is provided in CORAL In fact,

CORAL supports a number of annotations that can be used to choose a desired semantics or to provide optimization hints to the CORAL system The added complexity of queries in a recursive language makes optimization difficult, and the use of annotations often makes a big difference in the quality of the optimized evaluation plan

CORAL supports a class of programs with negation and grouping that is strictly larger than the class of stratified programs The bill-of-materials problem, in which the cost of a composite part is defined as being the sum of the costs of all atomic parts, is an example of a problem that requires this added generality

CORAL is closer to Prolog than to LDL in supporting nonground tuples; thus, the tuple equal(X,X) can

be stored in the database and denotes that every binary tuple in which the first and the second field values are the same is in the relation called equal From an evaluation standpoint, CORAL’s main evaluation techniques are based on bottom-up evaluation, which is very different from Prolog’s top-down evaluation However, CORAL also provides a Prolog-like top-down evaluation mode

From an implementation perspective, CORAL implements several optimizations to deal with

nonground tuples efficiently, in addition to techniques such as magic templates for pushing selections into recursive queries, pushing projections, and special optimizations of different kinds of (left- and right-) linear programs It also provides an efficient way to compute nonstratified queries A "shallow-compilation" approach is used, whereby the run-time system interprets the compiled plan CORAL uses the EXODUS storage manager to provide support for disk-resident relations It also has a good

interface with C++ and is extensible, enabling a user to customize the system for special applications

by adding new data types or relation implementations An interesting feature is an explanation package

that allows a user to examine graphically how a fact is generated; this is useful for debugging as well as for providing explanations

25.7 Deductive Object-Oriented Databases

25.7.1 Overview of DOODs

25.7.2 VALIDITY

The emergence of deductive database concepts is contemporaneous with initial work in Logic

Programming Deductive object-oriented databases (DOODs) came about through the integration of the

OO paradigm and logic programming The observation that OO and deductive database systems generally have complementary strengths and weaknesses gave rise to the integration of the two

paradigms

25.7.1 Overview of DOODs

Trang 18

Since the late 1980s, several DOOD prototypes were developed in universities and research

laboratories VALIDITY, which was developed at Bull, is the first industrial product in the DOOD arena The LDL and the CORAL systems we reviewed offer some additional object-orientated

features—e.g., in CORAL++ —and may be considered as DOODs

The following broad approaches have been adopted in the design of DOOD systems:

• Language extension: An existing deductive language model is extended with object-oriented

features For example, Datalog is extended to support identity, inheritance, and other OO features

• Language integration: A deductive language is integrated with an imperative programming

language in the context of an object model or type system The resulting system supports a range of standard programs, while allowing different and complementary programming paradigms to be used for different tasks, or for different parts of the same task This approach was pioneered by the Glue-Nail system

• Language reconstruction: An object model is reconstructed, creating a new logic language

that includes object-oriented features In this strategy, the goal is to develop an object logic that captures the essentials of the object-oriented paradigm and that can also be used as a deductive programming language in DOODs The rationale behind this approach is the argument that language extensions fail to combine object-orientation and logic successfully,

by losing declarativenesss or by failing to capture all aspects of the object-oriented model

25.7.2 VALIDITY

DEL Data Model

VALIDITY combines deductive capabilities with the ability to manipulate complex objects (OIDs, inheritance, methods, etc.) The ability to declaratively specify knowledge as deduction and integration rules brings knowledge independence Moreover, the logic-based language of deductive databases enables advanced tools, such as those for checking the consistency of a set of rules, to be developed When compared with systems extending SQL technology, deductive systems offer more expressive declarative languages and cleaner semantics VALIDITY provides the following:

1 A DOOD data model and language, called DEL (Datalog Extended Language)

2 An engine working along a client-server model

3 A set of tools for schema and rule editing, validation, and querying

The DEL data model provides object-oriented capabilities, similar to those offered by the ODMG data model (see Chapter 12), and includes both declarative and imperative features The declarative features include deductive and integrity rules, with full recursion, stratified negation, disjunction, grouping, and quantification The imperative features allow functions and methods to be written The engine of VALIDITY integrates the traditional functions of a database (persistency, concurrency control, crash recovery, etc.) with the advanced deductive capabilities for deriving information and verifying

semantic integrity The lowest level component of the engine is a fact manager that integrates storage, concurrency control, and recovery functions The fact manager supports fact identity and complex data items In addition to locking, the concurrency control protocol integrates read-consistency technology, used in particular when verifying constraints The higher-level component supports the DEL language and performs optimization, compilation, and execution of statements and queries The engine also supports an SQL interface permitting SQL queries and updates to be run on VALIDITY data

VALIDITY also has a deductive wrapper for SQL systems, called DELite This supports a subset of DEL functionality (no constraints, no recursion, limited object capabilities, etc.) on top of commercial SQL systems

Trang 19

DEL Data Model

The DEL data model integrates a rich type system with primitives to define persistent and derived data The DEL type system consists of built-in types, which can be used to implement user-defined and composite types Composite types are defined using four type constructors: (1) bag, (2) set, (3) list, and (4) tuple

The basic unit of information in VALIDITY is called a fact Facts are instances of predicates, which

are logical constructs characterized by a name and a set of typed attributes A fact specifies values to the attributes of the predicate of which it is an instance There are four kinds of predicates and facts in VALIDITY:

1 Basis facts: Are persistent units of information stored in the database; they are instances of

basis predicates, which have attributes and methods and are organized into inheritance

hierarchies

2 Derived facts: Are deduced from basis facts stored in the database or other derived facts; they are instances of derived predicates

3 Computed predicates and facts: These are similar to derived predicates and facts, but they are

computed by means of imperative code instead of derivation The distance between two points

is a typical example

4 Built-in predicates and facts: These are special computed predicates and facts whose

associated function is provided by VALIDITY Comparison operators are an example

Basis facts have an identity that is analogous to the notion of object identifier in OO databases Further, external mappings can be defined for a predicate; they enable the retrieval of facts (through their fact-IDs) based on the value of some of their unique attributes Basis predicates may also have methods in the OO sense—that is, functions can be invoked in the context of a specific fact

25.8 Applications of Commercial Deductive Database Systems

The LDL system has been applied to the following application domains:

• Enterprise modeling: This domain involves modeling the structure, processes, and constraints

within an enterprise Data related to an enterprise may result in an extended ER model containing hundreds of entities and relationships and thousands of attributes A number of applications useful to designers of new applications (as well as for management) can be developed based on this "metadatabase," which contains dictionary-like information about the whole enterprise

• Hypothesis testing or data dredging: This domain involves formulating a hypothesis,

translating it into an LDL rule set and a query, and then executing the query against given data

Trang 20

to test the hypothesis The process is repeated by reformulating the rules and the query This has been applied to genome data analysis in the field of microbiology, where data dredging consists of identifying the DNA sequences from low-level digitized autoradiographs from

experiments performed on E coli bacteria

• Software reuse: The bulk of the software for an application is developed in standard

procedural code, and a small fraction is rule-based and encoded in LDL The rules give rise to

a knowledge base that contains the following elements:

A definition of each C module used in the system

A set of rules that defines ways in which modules can export/import functions, constraints, and so on

The "knowledge base" can be used to make decisions that pertain to the reuse of software subsets Modules can be recombined to satisfy specific tasks, as long as the relevant rules are satisfied This is being experimented with in banking software

25.8.2 VALIDITY Applications

Knowledge independence is a term used by VALIDITY developers to refer to a technical version of business rule independence From a database standpoint, it is a step beyond data independence that brings about integration of data and rules The goal is to achieve streamlining of application

development (multiple applications share rules managed by the database), application maintenance (changes in definitions and in regulations are more easily done), and ease-of-use (interactions are done through high-level tools enabled by the logic foundation) For instance, it simplifies the task of the application programmer who does not need to include tests in his application to guarantee the

soundness of his transactions VALIDITY claims to be able to express, manage, and apply the business rules governing the interactions among various processes within a company

VALIDITY is an appropriate tool for applying software engineering principles to application

development It allows the formal specification of an application in the DEL language, which can then

be directly compiled This eliminates the error-prone step that most methodologies based on relationship conceptual designs and relational implementations require between specification and compilation The following are some application areas of the VALIDITY system:

entity-• Electronic commerce: In electronic commerce, complex customer profiles have to be matched

against target descriptions The profiles are built from various data sources In a current application, demographic data and viewing history compose the viewer’s profiles The matching process is also described by rules, and computed predicates deal with numeric computations The declarative nature of DEL makes the formulation of the matching

algorithm easy

• Rules-governed processes: In a rules-governed process, well-defined rules define the actions

to be performed An application prototype has been developed—its goal being to handle the management of dangerous gases placed in containers—and is coordinated by a large number

of frequently changing regulations The classes of dangerous materials are modeled as DEL classes The possible locations for the containers are constrained by rules, which reflect the regulations In the case of an incident, deduction rules identify potential accidents The main advantage of VALIDITY is the ease with which new regulations are taken into account

• Knowledge discovery: The goal of knowledge discovery is to find new data relationships by

analyzing existing data (see Section 26.2) An application prototype developed by the

University of Illinois utilizes already existing minority student data that has been enhanced with rules in DEL

Trang 21

• Concurrent engineering: A concurrent engineering application deals with large amounts of

centralized data, shared by several participants An application prototype has been developed

in the area of civil engineering The design data is modeled using the object-orientation power

of the DEL language When an inconsistency is detected, a new rule models the identified problem Once a solution has been identified, it is turned into a constraint DEL is able to handle transformation of rules into constraints, and it can also handle any closed formula as an integrity constraint

25.9 Summary

In this chapter we introduced deductive database systems, a relatively new branch of database

management This field has been influenced by logic programming languages, particularly by Prolog

A subset of Prolog called Datalog, which contains function-free Horn clauses, is primarily used as the basis of current deductive database work Concepts of Datalog were introduced here We discussed the standard backward-chaining inferencing mechanism of Prolog and a forward-chaining bottom-up strategy The latter has been adapted to evaluate queries dealing with relations (extensional databases),

by using standard relational operations together with Datalog Procedures for evaluating nonrecursive and recursive query processing were discussed and algorithms presented for naive and seminaive evaluation of recursive queries Negation is particularly difficult to deal with in such deductive

databases; a popular concept called stratified negation was introduced in this regard

We surveyed a commercial deductive database system called LDL originally developed at MCC and other experimental systems called CORAL and NAIL! The latest deductive database implementations are called DOODs They combine the power of object orientation with deductive capabilities The most recent entry on the commercial DOOD scene is VALIDITY, which we discussed here briefly The deductive database area is still in an experimental stage Its adoption by industry will give a boost to its development Toward this end, we mentioned practical applications in which LDL and VALIDITY are proving to be very valuable

Exercises

25.1 Add the following facts to the example database in Figure 25.03:

supervise (ahmad,bob), supervise (franklin,gwen)

First modify the supervisory tree in Figure 25.01(b) to reflect this change Then modify the diagram in Figure 25.04 showing the top-down evaluation of the query superior(james, Y)

25.2 Consider the following set of facts for the relation parent(X, Y), where Y is the parent of X:

Trang 22

parent(a,aa), parent(a,ab), parent(aa,aaa), parent(aa,aab), parent(aaa,aaaa), parent(aaa,aaab)

Consider the rules

: ancestor(X,Y) :- parent(X,Y)

: ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y)

which define ancestor Y of X as above

a Show how to solve the Datalog query

ancestor(aa,X)?

using the naive strategy Show your work at each step

b Show the same query by computing only the changes in the ancestor relation and using that in rule 2 each time

[This question is derived from Bancilhon and Ramakrishnan (1986).]

25.3 Consider a deductive database with the following rules:

ancestor(X,Y) :- father(X,Y)

ancestor(X,Y) :- father(X,Z), ancestor(Z,Y)

Notice that "father(X, Y)" means that Y is the father of X; "ancestor(X, Y)" means that

Y is the ancestor of X Consider the fact base

father(Harry,Issac), father(Issac,John), father(John,Kurt)

a Construct a model theoretic interpretation of the above rules using the given facts

b Consider that a database contains the above relations father(X, Y), another relation brother(X, Y), and a third relation birth(X, B), where B is the birthdate

of person X State a rule that computes the first cousins of the following variety: their fathers must be brothers

c Show a complete Datalog program with fact-based and rule-based literals that

Trang 23

computes the following relation: list of pairs of cousins, where the first person is born after 1960 and the second after 1970 You may use "greater than" as a built-in

predicate (Note: Sample facts for brother, birth, and person must also be shown.)

25.4 Consider the following rules:

reachable(X,Y) :- flight(X,Y)

reachable(X,Y) :- flight(X,Z), reachable(Z,Y)

where reachable(X, Y) means that city Y can be reached from city X, and flight(X, Y) means that there is a flight to city Y from city X

a Construct fact predicates that describe the following:

i Los Angeles, New York, Chicago, Atlanta, Frankfurt, Paris, Singapore, Sydney are cities

ii The following flights exist: LA to NY, NY to Atlanta, Atlanta to Frankfurt, Frankfurt to Atlanta, Frankfurt to Singapore, and Singapore to Sydney

(Note: No flight in reverse direction can be automatically assumed.)

b Is the given data cyclic? If so, in what sense?

c Construct a model theoretic interpretation (that is, an interpretation similar to the one shown in Figure 25.03) of the above facts and rules

d Consider the query

reachable(Atlanta,Sydney)?

How will this query be executed using naive and seminaive evaluation? List the series

of steps it will go through

e Consider the following rule defined predicates:

round-trip-reachable(X,Y) :- reachable(X,Y), reachable(Y,X) duration(X,Y,Z)

Draw a predicate dependency graph for the above predicates (Note: duration(X,

Y, Z) means that you can take a flight from X to Y in Z hours.)

f Consider the following query: What cities are reachable in 12 hours from Atlanta? Show how to express it in Datalog Assume built-in predicates like greater-than(X, Y) Can this be converted into a relational algebra statement in a straightforward way? Why or why not?

g Consider the predicate population(X, Y) where Y is the population of city X Consider the following query: List all possible bindings of the predicate pair pair (X, Y), where Y is a city that can be reached in two flights from city X, which has over 1 million people Show this query in Datalog Draw a corresponding query tree

in relational algebraic terms

25.5 Consider the following rules:

Trang 24

sgc(X,Y) :- eq(X,Y)

sgc(X,Y) :- par(X,X1), sgc(X1,Y1), par(Y,Y1)

and the EDB, PAR = {(d, g), (e, g), (b, d), (a, d), (a, h), (c, e)} What is the result

of the query

sgc(a,Y)?

Solve using the naive and seminaive methods

25.6 The following rules have been given:

path(X,Y) :- arc(X,Y)

path(X,Y) :- path(X,Z), path(Z,Y)

Suppose that the nodes in a graph are {a, b, c, d} and there are no arcs Let the set of paths, P

= {(a, b), (c, d)} Show that this model is not a fixed point

25.7 Consider the frequent flyer Skymiles program database at an airline It maintains the following relations:

99status(X,Y), 98status(X,Y), 98Miles(X,Y)

The status data refers to passenger X having a status Y for the year, where Y can be regular, silver, gold, or platinum Let the requirements for achieving gold status be expressed by:

99status(X,’gold’) :- 98status(X,’gold’) AND 98Miles(X,Y) AND Y>45000

Trang 25

99status(X,’gold’) :- 98status(X,’platinum’) AND 98Miles(X,Y) AND Y>40000

99status(X,’gold’) :- 98status(X,’regular’) AND 98Miles(X,Y) AND Y>50000

98Miles(X, Y) gives the miles Y flown by passenger X in 1998 Assume that similar rules exist for reaching other statuses

a Make up a set of other reasonable rules for achieving platinum status

b Is the above programmable in DATALOG? Why or why not?

c Write a prolog program with the above rules, populate the predicates with sample data, and show how a query like 99status(‘John Smith’, Y) is computed in Prolog

25.8 Consider a tennis tournament database with predicates rank (X, Y): X holds rank Y, beats (X1, X2): X1 beats X2, and superior (X1, X2): X1 is a superior player to X2 Assume that if a player beats another player he is superior to that player and assume that if a player 1 beats player 2 and player 2 is superior to 3 then 1 is superior to 3

Construct a set of recursive rules using the above predicates (Note: We shall hypothetically

assume that there are no "upsets"—that the above rule is always met.)

a Construct a set of recursive rules

b Populate data for beats relation with 10 players playing 3 matches each

c Show a computation of the superior table using this data

d Does the superior have a fixpoint? Why or why not? Explain

For the population of players in the database, assuming John is one of the players, how do you compute "superior (john, X)?" using naive, and seminaive algorithms?

Selected Bibliography

The early developments of the logic and database approach are surveyed by Gallaire et al (1984) Reiter (1984) provides a reconstruction of relational database theory, while Levesque (1984) provides a discussion of incomplete knowledge in light of logic Gallaire and Minker (1978) provide an early book on this topic A detailed treatment of logic and databases appears in Ullman (1989, vol 2), and there is a related chapter in Volume 1 (1988) Ceri, Gottlob, and Tanca (1990) present a comprehensive yet concise treatment of logic and databases Das (1992) is a comprehensive book on deductive databases and logic programming The early history of Datalog is covered in Maier and Warren (1988) Clocksin and Mellish (1994) is an excellent reference on Prolog language

Aho and Ullman (1979) provide an early algorithm for dealing with recursive queries, using the least fixed-point operator Bancilhon and Ramakrishnan (1986) give an excellent and detailed description of the approaches to recursive query processing, with detailed examples of the naive and seminaive approaches Excellent survey articles on deductive databases and recursive query processing include Warren (1992) and Ramakrishnan and Ullman (1993) A complete description of the seminaive approach based on relational algebra is given in Bancilhon (1985) Other approaches to recursive query processing include the recursive query/subquery strategy of Vieille (1986), which is a top-down interpreted strategy, and the Henschen-Naqvi (1984) top-down compiled iterative strategy Balbin and Rao (1987) discuss an extension of the seminaive differential approach for multiple predicates

Trang 26

The original paper on magic sets is by Bancilhon et al (1986) Beeri and Ramakrishnan (1987) extends

it Mumick et al (1990) show the applicability of magic sets to nonrecursive nested SQL queries Other approaches to optimizing rules without rewriting them appear in Vieille (1986, 1987) Kifer and Lozinskii (1986) propose a different technique Bry (1990) discusses how the top-down and bottom-up approaches can be reconciled Whang and Navathe (1992) describe an extended disjunctive normal form technique to deal with recursion in relational algebra expressions for providing an expert system interface over a relational DBMS

Chang (1981) describes an early system for combining deductive rules with relational databases The LDL system prototype is described in Chimenti et al (1990) Krishnamurthy and Naqvi (1989)

introduce the "choice" notion in LDL Zaniolo (1988) discusses the language issues for the LDL system A language overview of CORAL is provided in Ramakrishnan et al (1992), and the

implementation is described in Ramakrishnan et al (1993) An extension to support object-oriented features, called CORAL++, is described in Srivastava et al (1993) Ullman (1985) provides the basis for the NAIL! system, which is described in Morris et al (1987) Phipps et al (1991) describe the GLUE-NAIL! deductive database system

Zaniolo (1990) reviews the theoretical background and the practical importance of deductive databases Nicolas (1997) gives an excellent history of the developments leading up to DOODs Falcone et al (1997) survey the DOOD landscape References on the VALIDITY system include Friesen et al (1995), Vieille (1997), and Dietrich et al (1999)

Trang 27

The most commonly chosen domain is finite and is called the Herbrand Universe

Note 5

Notice that, in our example, the order of search is quite similar for both forward and backward

chaining However, this is not generally the case

Note 6

For a detailed discussion of fixed points, consult Ullman (1988)

Chapter 26: Data Warehousing And Data Mining

management upward with information at the correct level of detail to support decision making Data

warehousing, on-line analytical processing (OLAP), and data mining provide this functionality In this

chapter we give a broad overview of each of these technologies

The market for such support has been growing rapidly since the mid-1990s As managers become increasingly aware of the growing sophistication of analytic capabilities of these data-based systems, they look increasingly for more sophisticated support for their key organizational decisions

26.1 Data Warehousing

26.1.1 Terminology and Definitions

26.1.2 Characteristics of Data Warehouses

26.1.3 Data Modeling for Data Warehouses

Trang 28

26.1.4 Building a Data Warehouse

26.1.5 Typical Functionality of Data Warehouses

26.1.6 Difficulties of Implementing Data Warehouses

26.1.7 Open Issues in Data Warehousing

Because data warehouses have been developed in numerous organizations to meet particular needs, there is no single, canonical definition of the term data warehouse (Note 1) Professional magazine articles and books in the popular press have elaborated on the meaning in a variety of ways Vendors have capitalized on the popularity of the term to help market a variety of related products, and

consultants have provided a large variety of services, all under the data warehousing banner However, data warehouses are quite distinct from traditional databases in their structure, functioning,

performance, and purpose

26.1.1 Terminology and Definitions

W H Inmon (Note 2) characterized a data warehouse as "a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions." Data warehouses provide access

to data for complex analysis, knowledge discovery, and decision making

They support high-performance demands on an organization's data and information Several types of

applications—OLAP, DSS, and data mining applications—are supported OLAP (on-line analytical

processing) is a term used to describe the analysis of complex data from the data warehouse In the

hands of skilled knowledge workers, OLAP tools use distributed computing capabilities for analyses that require more storage and processing power than can be economically and efficiently located on an

individual desktop DSS (decision-support systems) also known as EIS (executive information

systems) (not to be confused with enterprise integration systems) support an organization's leading

decision makers with higher-level data for complex and important decisions Data mining (which we will discuss in detail in Section 26.2) is used for knowledge discovery, the process of searching data for unanticipated new knowledge

Traditional databases support on-line transaction processing (OLTP), which includes insertions,

updates, and deletions, while also supporting information query requirements Traditional relational databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process Thus, they cannot be

optimized for OLAP, DSS, or data mining By contrast, data warehouses are designed precisely to support efficient extraction, processing, and presentation for analytic and decision-making purposes In comparison to traditional databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms

26.1.2 Characteristics of Data Warehouses

To discuss data warehouses and distinguish them from transactional databases calls for an appropriate data model The multidimensional data model (explained in more detail below) is a good fit for OLAP and decision-support technologies In contrast to multidatabases, which provide access to disjoint and usually heterogeneous databases, a data warehouse is frequently a store of integrated data from

multiple sources, processed for storage in a multidimensional model Unlike most transactional

databases, data warehouses typically support time-series and trend analysis, both of which require more historical data than are generally maintained in transactional databases Compared with transactional databases, data warehouses are nonvolatile That means that information in the data warehouse changes far less often and may be regarded as non-real-time with periodic updating In transactional systems, transactions are the unit and are the agent of change to the database; by contrast, data warehouse

Trang 29

information is much more coarse grained and is refreshed according to a careful choice of refresh policy, usually incremental Warehouse updates are handled by the warehouse's acquisition component that provides all required preprocessing

We can also describe data warehousing more generally as "a collection of decision support

technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions" (Note 3) Figure 26.01 gives an overview of the conceptual structure of a data warehouse It shows the entire data warehousing process This process includes possible cleaning and reformatting of data before its warehousing At the back end of the process, OLAP, data mining, and DSS may generate new relevant information such as rules; this information is shown in the figure going back into the warehouse The figure also shows that data sources may include files

Data warehouses have the following distinctive characteristics (Note 4)

• multidimensional conceptual view

• generic dimensionality

• unlimited dimensions and aggregation levels

• unrestricted cross-dimensional operations

• dynamic sparse matrix handling

• client-server architecture

• multi-user support

• accessibility

• transparency

• intuitive data manipulation

• consistent reporting performance

• Data marts generally are targeted to a subset of the organization, such as a department, and

are more tightly focused

26.1.3 Data Modeling for Data Warehouses

Multidimensional models take advantage of inherent relationships in data to populate data in

multidimensional matrices called data cubes (These may be called hypercubes if they have more than three dimensions.) For data that lend themselves to dimensional formatting, query performance in multidimensional matrices can be much better than in the relational data model Three examples of dimensions in a corporate data warehouse would be the corporation's fiscal periods, products, and regions

Trang 30

A standard spreadsheet is a two-dimensional matrix One example would be a spreadsheet of regional sales by product for a particular time period Products could be shown as rows, with sales revenues for each region comprising the columns (Figure 26.02 shows this two-dimensional organization.) Adding

a time dimension, such as an organization's fiscal quarters, would produce a three-dimensional matrix, which could be represented using a data cube

In Figure 26.03 there is a three-dimensional data cube that organizes product sales data by fiscal quarters and sales regions Each cell could contain data for a specific product, specific fiscal quarter, and specific region By including additional dimensions, a data hypercube could be produced, although more than three dimensions cannot be easily visualized at all or presented graphically The data can be queried directly in any combination of dimensions, bypassing complex database queries Tools exist for viewing data according to the user's choice of dimensions Changing from one dimensional hierarchy

(orientation) to another is easily accomplished in a data cube by a technique called pivoting (also

called rotation) In this technique the data cube can be thought of as rotating to show a different orientation of the axes For example, you might pivot the data cube to show regional sales revenues as rows, the fiscal quarter revenue totals as columns, and the company's products in the third dimension (Figure 26.04) Hence, this technique is equivalent to having a regional sales table for each product separately, where each table shows quarterly sales for that product region by region

Multidimensional models lend themselves readily to hierarchical views in what is known as roll-up

display and drill-down display Roll-up display moves up the hierarchy, grouping into larger units

along a dimension (e.g., summing weekly data by quarter, or by year) Figure 26.05 shows a roll-up display that moves from individual products to a coarser grain of product categories Shown in Figure

26.06, a drill-down display provides the opposite capability, furnishing a finer-grained view, perhaps

disaggregating country sales by region and then regional sales by subregion and also breaking up products by styles

Trang 31

The multidimensional storage model involves two types of tables: dimension tables and fact tables A

dimension table consists of tuples of attributes of the dimension A fact table can be thought of as

having tuples, one per a recorded fact This fact contains some measured or observed variable(s) and identifies it (them) with pointers to dimension tables The fact table contains the data and the

dimensions identify each tuple in that data Figure 26.07 contains an example of a fact table that can be viewed from the perspective of multiple dimension tables

Two common multidimensional schemas are the star schema and the snowflake schema The star

schema consists of a fact table with a single table for each dimension (Figure 26.07) The snowflake schema is a variation on the star schema in which the dimensional tables from a star schema are

organized into a hierarchy by normalizing them (Figure 26.08) Some installations are normalizing data warehouses up to the third normal form so that they can access the data warehouse to the finest level of

detail A fact constellation is a set of fact tables that share some dimension tables Figure 26.09 shows

a fact constellation with two fact tables, business results and business forecast These share the

dimension table called product Fact constellations limit the possible queries for the warehouse

Data warehouse storage also utilizes indexing techniques to support high performance access (see

Chapter 6 for a discussion of indexing) A technique called bitmap indexing constructs a bit vector for

each value in a domain (column) being indexed It works very well for domains of low-cardinality

There is a 1 bit placed in the jth position in the vector if the jth row contains the value being indexed

For example, imagine an inventory of 100,000 cars with a bitmap index on car size If there are four car sizes—economy, compact, midsize, and fullsize—there will be four bit vectors, each containing 100,000 bits (12.5 K) for a total index size of 50K Bitmap indexing can provide considerable

input/output and storage space advantages in low-cardinality domains With bit vectors a bitmap index can provide dramatic improvements in comparison, aggregation, and join performance In a star

schema, dimensional data can be indexed to tuples in the fact table by join indexing Join indexes are

traditional indexes to maintain relationships between primary key and foreign key values They relate the values of a dimension of a star schema to rows in the fact table For example, consider a sales fact table that has city and fiscal quarter as dimensions If there is a join index on city, for each city the join index maintains the tuple IDs of tuples containing that city Join indexes may involve multiple

dimensions

Data warehouse storage can facilitate access to summary data by taking further advantage of the nonvolatility of data warehouses and a degree of predictability of the analyses that will be performed using them Two approaches have been used: (1) smaller tables including summary data such as quarterly sales or revenue by product line, and (2) encoding of level (e.g., weekly, quarterly, annual) into existing tables By comparison, the overhead of creating and maintaining such aggregations would likely be excessive in a volatile, transaction-oriented database

Trang 32

26.1.4 Building a Data Warehouse

In constructing a data warehouse, builders should take a broad view of the anticipated use of the warehouse There is no way to anticipate all possible queries or analyses during the design phase

However, the design should specifically support ad-hoc querying, that is, accessing data with any

meaningful combination of values for the attributes in the dimension or fact tables For example, a marketing-intensive consumer-products company would require different ways of organizing the data warehouse than would a nonprofit charity focused on fund raising An appropriate schema should be chosen that reflects anticipated usage

Acquisition of data for the warehouse involves the following steps:

• The data must be extracted from multiple, heterogeneous sources, for example, databases or other data feeds such as those containing financial market data or environmental data

• Data must be formatted for consistency within the warehouse Names, meanings, and domains

of data from unrelated sources must be reconciled For instance, subsidiary companies of a large corporation may have different fiscal calendars with quarters ending on different dates, making it difficult to aggregate financial data by quarter Various credit cards may report their transactions differently, making it difficult to compute all credit sales These format

inconsistencies must be resolved

• The data must be cleaned to ensure validity Data cleaning is an involved and complex process that has been identified as the largest labor-demanding component of data warehouse

construction For input data, cleaning must occur before the data are loaded into the

warehouse There is nothing about cleaning data that is specific to data warehousing and that could not be applied to a host database However, since input data must be examined and formatted consistently, data warehouse builders should take this opportunity to check for validity and quality Recognizing erroneous and incomplete data is difficult to automate, and cleaning that requires automatic error correction can be even tougher Some aspects, such as domain checking, are easily coded into data cleaning routines, but automatic recognition of other data problems can be more challenging (For example, one might require that City = 'San Francisco' together with State = 'CT' be recognized as an incorrect combination.) After such problems have been taken care of, similar data from different sources must be coordinated for loading into the warehouse As data managers in the organization discover that their data are being cleaned for input into the warehouse, they will likely want to upgrade their data with the

cleaned data The process of returning cleaned data to the source is called backflushing (see

Figure 26.01)

• The data must be fitted into the data model of the warehouse Data from the various sources must be installed in the data model of the warehouse Data may have to be converted from relational, object-oriented, or legacy databases (network and/or hierarchical) to a

multidimensional model

• The data must be loaded into the warehouse The sheer volume of data in the warehouse makes loading the data a significant task Monitoring tools for loads as well as methods to recover from incomplete or incorrect loads are required With the huge volume of data in the warehouse, incremental updating is usually the only feasible approach The refresh policy will probably emerge as a compromise that takes into account the answers to the following

questions:

• How up-to-date must the data be?

• Can the warehouse go off-line, and for how long?

• What are the data interdependencies?

• What is the storage availability?

• What are the distribution requirements (such as for replication and partitioning)?

• What is the loading time (including cleaning, formatting, copying, transmitting, and overhead such as index rebuilding)?

Trang 33

As we have said, databases must strike a balance between efficiency in transaction processing and supporting query requirements (ad hoc user requests), but a data warehouse is typically optimized for access from a decision maker's needs Data storage in a data warehouse reflects this specialization and involves the following processes:

• Storing the data according to the data model of the warehouse

• Creating and maintaining required data structures

• Creating and maintaining appropriate access paths

• Providing for time-variant data as new data are added

• Supporting the updating of warehouse data

• Refreshing the data

• Purging data

Although adequate time can be devoted initially to constructing the warehouse, the sheer volume of data in the warehouse generally makes it impossible to simply reload the warehouse in its entirety later

on Alternatives include selective (partial) refreshing of data and separate warehouse versions

(requiring double storage capacity for the warehouse!) When the warehouse uses an incremental data refreshing mechanism, data may need to be periodically purged; for example, a warehouse that

maintains data on the previous twelve business quarters may periodically purge its data each year

Data warehouses must also be designed with full consideration of the environment in which they will reside Important design considerations include the following:

• Usage projections

• The fit of the data model

• Characteristics of available sources

• Design of the metadata component

• Modular component design

• Design for manageability and change

• Considerations of distributed and parallel architecture

We discuss each of these in turn Warehouse design is initially driven by usage projections; that is, by expectations about who will use the warehouse and in what way Choice of a data model to support this usage is a key initial decision Usage projections and the characteristics of the warehouse's data sources are both taken into account Modular design is a practical necessity to allow the warehouse to evolve with the organization and its information environment In addition, a well-built data warehouse must be designed for maintainability, enabling the warehouse managers to effectively plan for and manage change while providing optimal support to users

You may recall the term metadata from Chapter 2; metadata was defined as the description of a

database including its schema definition The metadata repository is a key data warehouse

component The metadata repository includes both technical and business metadata The first, technical metadata, covers details of acquisition processing, storage structures, data descriptions, warehouse operations and maintenance, and access support functionality The second, business metadata, includes the relevant business rules and organizational details supporting the warehouse

The architecture of the organization's distributed computing environment is a major determining characteristic for the design of the warehouse There are two basic distributed architectures: the

distributed warehouse and the federated warehouse For a distributed warehouse, all the issues of

distributed databases are relevant, for example, replication, partitioning, communications, and

consistency concerns A distributed architecture can provide benefits particularly important to

warehouse performance, such as improved load balancing, scalability of performance, and higher availability A single replicated metadata repository would reside at each distribution site The idea of

the federated warehouse is like that of the federated database: a decentralized confederation of

autonomous data warehouses, each with its own metadata repository Given the magnitude of the challenge inherent to data warehouses, it is likely that such federations will consist of smaller-scale components, such as data marts Large organizations may choose to federate data marts rather than build huge data warehouses

Trang 34

26.1.5 Typical Functionality of Data Warehouses

Data Warehousing and Views

Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries Accordingly, data warehouses must provide far greater and more efficient query support than is demanded of

transactional databases The data warehouse access component supports enhanced spreadsheet

functionality, efficient query processing, structured queries, ad hoc queries, data mining, and

materialized views In particular, enhanced spreadsheet functionality includes support for art spreadsheet applications (e.g., MS Excel) as well as for OLAP applications programs These offer preprogrammed functionalities such as the following:

state-of-the-• Roll-up: Data is summarized with increasing generalization (e.g., weekly to quarterly to annually)

• Drill-down: Increasing levels of detail are revealed (the complement of roll-up)

• Pivot: Cross tabulation (also referred as rotation) is performed

• Slice and dice: Performing projection operations on the dimensions

• Sorting: Data is sorted by ordinal value

• Selection: Data is available by value or range

• Derived (computed) attributes: Attributes are computed by operations on stored and derived values

Because data warehouses are free from the restrictions of the transactional environment there is an increased efficiency in query processing Among the tools and techniques used are: query

transformation, index intersection and union, special ROLAP (relational OLAP) and MOLAP

(multidimensional OLAP) functions, SQL extensions, advanced join methods, and intelligent scanning (as in piggy-backing multiple queries)

Improved performance has also been attained with parallel processing Parallel server architectures include symmetric multiprocessor (SMP), cluster, and massively parallel processing (MPP), and combinations of these

Knowledge workers and decision makers use tools ranging from parametric queries to ad hoc queries to data mining Thus, the access component of the data warehouse must provide support of structured queries (both parametric and ad hoc) These together make up a managed query environment Data mining itself uses techniques from statistical analysis and artificial intelligence Statistical analysis can

be performed by advanced spreadsheets, by sophisticated statistical analysis software, or by written programs Techniques such as lagging, moving averages, and regression analysis are also commonly employed Artificial intelligence techniques, which may include genetic algorithms and neural networks, are used for classification and are employed to discover knowledge from the data warehouse that may be unexpected or difficult to specify in queries (We treat data mining in detail in Section 26.2.)

custom-Data Warehousing and Views

Some people have considered data warehouses to be an extension of database views Earlier we mentioned materialized views as one way of meeting requirements for improved access to data (see Chapter 8 for a discussion of views) Materialized views have been explored for their performance enhancement Views, however, provide only a subset of the functions and capabilities of data

warehouses Views and data warehouses are alike in that they both have read-only extracts from databases and subject-orientation However, data warehouses are different from views in the following ways:

Trang 35

• Data warehouses exist as persistent storage instead of being materialized on demand

• Data warehouses are not usually relational, but rather multidimensional Views of a relational database are relational

• Data warehouses can be indexed to optimize performance Views cannot be indexed

independent from of the underlying databases

• Data warehouses characteristically provide specific support of functionality; views cannot

• Data warehouses provide large amounts of integrated and often temporal data, generally more than is contained in one database, whereas views are an extract of a database

26.1.6 Difficulties of Implementing Data Warehouses

Some significant operational issues arise with data warehousing: construction, administration, and quality control Project management—the design, construction, and implementation of the

warehouse—is an important and challenging consideration that should not be underestimated The building of an enterprise-wide warehouse in a large organization is a major undertaking, potentially taking years from conceptualization to implementation Because of the difficulty and amount of lead time required for such an undertaking, the widespread development and deployment of data marts may provide an attractive alternative, especially to those organizations with urgent needs for OLAP, DSS, and/or data mining support

The administration of a data warehouse is an intensive enterprise, proportional to the size and

complexity of the warehouse An organization that attempts to administer a data warehouse must realistically understand the complex nature of its administration Although designed for read-access, a data warehouse is no more a static structure than any of its information sources Source databases can

be expected to evolve The warehouse's schema and acquisition component must be expected to be updated to handle these evolutions

A significant issue in data warehousing is the quality control of data Both quality and consistency of data are major concerns Although the data passes through a cleaning function during acquisition, quality and consistency remain significant issues for the database administrator Melding data from heterogeneous and disparate sources is a major challenge given differences in naming, domain

definitions, identification numbers, and the like Every time a source database changes, the data warehouse administrator must consider the possible interactions with other elements of the warehouse

Usage projections should be estimated conservatively prior to construction of the data warehouse and should be revised continually to reflect current requirements As utilization patterns become clear and change over time, storage and access paths can be tuned to remain optimized for support of the

organization's use of its warehouse This activity should continue throughout the life of the warehouse

in order to remain ahead of demand The warehouse should also be designed to accommodate addition and attrition of data sources without major redesign Sources and source data will evolve, and the warehouse must accommodate such change Fitting the available source data into the data model of the warehouse will be a continual challenge, a task that is as much art as science Because there is

continual rapid change in technologies, both the requirements and capabilities of the warehouse will change considerably over time Additionally, data warehousing technology itself will continue to evolve for some time so that component structures and functionalities will continually be upgraded This certain change is excellent motivation for having fully modular design of components

Administration of a data warehouse will require far broader skills than are needed for traditional database administration A team of highly skilled technical experts with overlapping areas of expertise will likely be needed, rather than a single individual Like database administration, data warehouse administration is only partly technical; a large part of the responsibility requires working effectively with all the members of the organization with an interest in the data warehouse However difficult that can be at times for database administrators, it is that much more challenging for data warehouse administrators, as the scope of their responsibilities is considerably broader

Trang 36

Design of the management function and selection of the management team for a database warehouse are crucial Managing the data warehouse in a large organization will surely be a major task Many commercial tools are already available to support management functions Effective data warehouse management will certainly be a team function, requiring a wide set of technical skills, careful

coordination, and effective leadership Just as we must prepare for the evolution of the warehouse, we must also recognize that the skills of the management team will, of necessity, evolve with it

26.1.7 Open Issues in Data Warehousing

There has been much marketing hyperbole surrounding the term "data warehouse"; the exaggerated expectations will probably subside, but the concept of integrated data collections to support

sophisticated analysis and decision support will undoubtedly endure

Data warehousing as an active research area is likely to see increased research activity in the near future as warehouses and data marts proliferate Old problems will receive new emphasis; for example, data cleaning, indexing, partitioning, and views could receive renewed attention

Academic research into data warehousing technologies will likely focus on automating aspects of the warehouse that currently require significant manual intervention, such as the data acquisition, data quality management, selection and construction of appropriate access paths and structures, self-maintainability, functionality, and performance optimization Application of active database

functionality (see Section 23.1) into the warehouse is likely also to receive considerable attention Incorporation of domain and business rules appropriately into the warehouse creation and maintenance process may make it more intelligent, relevant, and self-governing

Commercial software for data warehousing is already available from a number of vendors, focusing principally on management of the data warehouse and OLAP/DSS applications Other aspects of data warehousing, such as design and data acquisition (especially cleaning), are being addressed primarily

by teams of in-house IT managers and consultants

26.2 Data Mining

26.2.1 An Overview of Data Mining Technology

26.2.2 Association Rules

26.2.3 Approaches to Other Data Mining Problems

26.2.4 Applications of Data Mining

26.2.5 State-of-the-Art of Commercial Data Mining Tools

Over the last three decades, many organizations have generated a large amount of machine-readable data in the form of files and databases To process this data, we have the database technology available

to us that supports query languages like SQL The problem with SQL is that it is a structured language that assumes the user is aware of the database schema SQL supports operations of relational algebra that allow a user to select from tables (rows and columns of data) or join related information from tables based on common fields In the last section we saw that data warehousing technology affords types of functionality, that of consolidation, aggregation, and summarization of data It lets us view the same information along multiple dimensions In this section, we will focus our attention on yet another

very popular area of interest known as data mining As the term connotes, data mining refers to the

mining or discovery of new information in terms of patterns or rules from vast amounts of data To be practically useful, data mining must be carried out efficiently on large files and databases To date, it is

not well-integrated with database management systems

Trang 37

We will briefly review the state of the art of this rather extensive field of data mining, which uses techniques from such areas as machine learning, statistics, neural networks, and genetic algorithms We will highlight the nature of the information that is discovered, the types of problems faced in databases, and potential applications We also survey the state of the art of a large number of commercial tools available (see Section 26.2.5) and describe a number of research advances that are needed to make this area viable

26.2.1 An Overview of Data Mining Technology

Data Mining and Data Warehousing

Data Mining as a Part of the Knowledge Discovery Process

Goals of Data Mining and Knowledge Discovery

Types of Knowledge Discovered during Data Mining

In reports such as the very popular Gartner Report (Note 5), data mining has been hailed as one of the top technologies for the near future In this section we relate data mining to the broader area called knowledge discovery and contrast the two by means of an illustrative example We also discuss a number of data mining techniques and algorithms in Section 26.2.3

Data Mining and Data Warehousing

The goal of a data warehouse is to support decision making with data Data mining can be used in conjunction with a data warehouse to help with certain types of decisions Data mining can be applied

to operational databases with individual transactions To make data mining more efficient, the data warehouse should have an aggregated or summarized collection of data Data mining helps in

extracting meaningful new patterns that cannot be found necessarily by merely querying or processing data or metadata in the data warehouse Data mining applications should therefore be strongly

considered early, during the design of a data warehouse Also, data mining tools should be designed to facilitate their use in conjunction with data warehouses In fact, for very large databases running into terabytes of data, successful use of database mining applications will depend first on the construction

of a data warehouse

Data Mining as a Part of the Knowledge Discovery Process

Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more

than data mining The knowledge discovery process comprises six phases (Note 6): data selection, data cleansing, enrichment, data transformation or encoding, data mining, and the reporting and display of the discovered information

As an example, consider a transaction database maintained by a specialty consumer goods retailer Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount A variety of new knowledge can be discovered by KDD

processing on this client database During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected The data cleansing

process then may correct invalid zip codes or eliminate records with incorrect phone prefixes

Enrichment typically enhances the data with additional sources of information For example, given the

client names and phone numbers, the store may purchase other data about age, income, and credit

rating and append them to each record Data transformation and encoding may be done to reduce the

amount of data For instance, item codes may be grouped in terms of product categories into audio,

Trang 38

video, supplies, electronic gadgets, camera, accessories, and so on Zip codes may be aggregated into geographic regions, incomes may be divided into ten ranges, and so on Earlier, in Figure 26.01, we showed a step called cleaning as a precursor to the data warehouse creation If data mining is based on

an existing warehouse for this retail store chain, we would expect that the cleaning has already been

applied It is only after such preprocessing that data mining techniques are used to mine different rules

and patterns For example, the result of mining may be to discover:

• Association rules—e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget

• Sequential patterns—e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, and within six months an accessory item A customer who buys more than twice in the lean periods may be likely to buy at least once during Christmas period

• Classification trees—e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by affinity for types of items, and some revealing statistics may be generated for such classes

We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such as age, income-group, place of residence, to what and how much the customers purchase This information can then be utilized to plan additional store locations based on demographics, to run store promotions, to combine items in advertisements, or to plan seasonal marketing strategies As this retail-store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions

The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualizations

Goals of Data Mining and Knowledge Discovery

Broadly speaking, the goals of data mining fall into the following classes: prediction, identification, classification, and optimization

• Prediction—Data mining can show how certain attributes within the data will behave in the

future Examples of predictive data mining include the analysis of buying transactions to predict what consumers will buy under certain discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits In such applications, business logic is used coupled with data mining In a scientific context, certain seismic wave patterns may predict an earthquake with high probability

• Identification—Data patterns can be used to identify the existence of an item, an event, or an

activity For example, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence The

area known as authentication is a form of identification It ascertains whether a user is indeed

a specific user or one from an authorized class; it involves a comparison of parameters or images or signals against a database

• Classification—Data mining can partition the data so that different classes or categories can

be identified based on combinations of parameters For example, customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, and infrequent shoppers This classification may be used in different analyses of customer buying transactions as a post-mining activity Sometimes classification based on common domain knowledge is used as an input to decompose the mining problem and make it simpler For instance, health foods, party foods, or school lunch foods are distinct categories in the supermarket business It makes sense to analyze relationships within and across categories as separate problems Such categorization may be used to encode the data appropriately before subjecting it to further data mining

Trang 39

• Optimization—One eventual goal of data mining may be to optimize the use of limited

resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints As such, this goal of data mining resembles the objective function used in operations research problems that deals with optimization under constraints

The term data mining is currently used in a very broad sense In some situations it includes statistical analysis and constrained optimization as well as machine learning There is no sharp line separating data mining from these disciplines It is beyond our scope, therefore, to discuss in detail the entire range of applications that make up this vast body of work

Types of Knowledge Discovered during Data Mining

The term "knowledge" is very broadly interpreted as involving some degree of intelligence Knowledge

is often classified as inductive and deductive We discussed discovery of deductive knowledge in Chapter 25 Data mining addresses inductive knowledge Knowledge can be represented in many forms: in an unstructured sense, it can be represented by rules, or propositional logic In a structured form, it may be represented in decision trees, semantic networks, neural networks, or hierarchies of classes or frames The knowledge discovered during data mining can be described in five ways, as follows

1 Association rules—These rules correlate the presence of a set of items with another range of

values for another set of variables Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c

2 Classification hierarchies—The goal is to work from an existing set of events or transactions

to create a hierarchy of classes Examples: (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions (2) A model may be developed for the factors that determine the desirability of location of a store on a 1–10 scale (3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability

3 Sequential patterns—A sequence of actions or events is sought Example: If a patient

underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months Detection of sequential patterns is equivalent to detecting

association among events with certain temporal relationships

4 Patterns within time series—Similarities can be detected within positions of the time series

Three examples follow with the stock market price data as a time series: (1) Stocks of a utility company ABC Power and a financial company XYZ Securities show the same pattern during

1998 in terms of closing stock price (2) Two products show the same selling pattern in summer but a different one in winter (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions

5 Categorization and segmentation—A given population of events or items can be partitioned

(segmented) into sets of "similar" elements Examples: (1) An entire population of treatment data on a disease may be divided into groups based on the similarity of side effects produced (2) The adult population in the United States may be categorized into five groups from "most likely to buy" to "least likely to buy" a new product (3) The web accesses made by a

collection of users against a set of documents (say, in a digital library) may be analyzed in terms of the keywords of documents to reveal clusters or categories of users

For most applications, the desired knowledge is a combination of the above types We expand on each

of the above knowledge types in the following subsections

Trang 40

26.2.2 Association Rules

Basic Algorithms for Finding Association Rules

Association Rules among Hierarchies

Negative Associations

Additional Considerations for Association Rules

One of the major technologies in data mining involves the discovery of association rules The database

is regarded as a collection of transactions, each involving a set of items A common example is that of market-basket data Here the market basket corresponds to what a consumer buys in a supermarket during one visit Consider four such transactions in a random sample:

An association rule is of the form

X Y,

where X = , and Y = are sets of items, with x i and y j being distinct items for all i and all j This

association states that if a customer buys X, he or she is also likely to buy Y In general, any

association rule has the form LHS (left-hand side) RHS (right-hand side), where LHS and RHS are sets

of items Association rules should supply both support and confidence

The support for the rule LHS RHS is the percentage of transactions that hold all of the items in the

union, the set LHS RHS If the support is low, it implies that there is no overwhelming evidence that items in LHS RHS occur together, because the union happens in only a small fraction of transactions The rule Milk Juice has 50% support, while Bread Juice has only 25% support Another term for

support is prevalence of the rule

To compute confidence we consider all transactions that include items in LHS The confidence for the

association rule LHS RHS is the percentage (fraction) of such transactions that also include RHS

Another term for confidence is strength of the rule

For Milk Juice, the confidence is 66.7% (meaning that, of three transactions in which milk occurs, two contain juice) and bread juice has 50% confidence (meaning that one of two transactions containing bread also contains juice.)

As we can see, support and confidence do not necessarily go hand in hand The goal of mining

association rules, then, is to generate all possible rules that exceed some minimum user-specified support and confidence thresholds The problem is thus decomposed into two subproblems:

Ngày đăng: 08/08/2014, 18:22

TỪ KHÓA LIÊN QUAN