Joe Celko s SQL for Smarties - Advanced SQL Programming P66 pps

624 CHAPTER 28: TREES AND HIERARCHIES IN SQL children, of a node the parent are every node in the subtree that has the parent node as its root.. Another defining property is that a path

Trang 1

622 CHAPTER 27: SUBSETS

fourth step: distinct missing ifcs that are also missing clubs INSERT INTO Samples (member_id, club_name, ifc)

SELECT MIN(Memberships.member_id), MIN(Memberships.club_name, Memberships.ifc

FROM Memberships, MissingClubs, MissingIfcs

WHERE Memberships.club_name = MissingClubs.club_name)

AND Memberships.ifc = MissingIfcs.ifc

GROUP BY Memberships.ifc;

fifth step: remaining missing ifcs

INSERT INTO Samples (member_id, club_name, ifc)

SELECT MIN(Memberships.member_id), MIN(memberships.club_name), memberships.ifc

FROM Memberships, MissingIfcs

WHERE Memberships.ifc = MissingIfcs.ifc

GROUP BY Memberships.ifc;

sixth step: remaining missing clubs

INSERT INTO Samples (Member_id, club_name, ifc)

SELECT MIN(Memberships.Member_id), Memberships.club_name,

MIN(Memberships.ifc) FROM Memberships, MissingClubs

WHERE Memberships.club_name = MissingClubs.club_name

GROUP BY Memberships.club_name;

We can check the candidate rows for redundancy removal with the two views that were created earlier to be sure

Trang 2

C H A P T E R

28 Trees and Hierarchies in SQL

I HAVE A SEPARATE book (Joe Celko’s Trees and Hierarchies in SQL for Smarties, 2004) devoted to this topic in great detail, so this chapter will

be a very quick discussion of the three major approaches to modeling trees and hierarchies in SQL

A tree is a special kind of directed graph Graphs are data structures that are made up of nodes (usually shown as boxes) connected by edges (usually shown as lines with arrowheads) Each edge represents a one-way relationship between the two nodes it connects In an organizational chart, the nodes are positions that can

be filled by employees, and each edge is the “reports to” relationship

In a parts explosion (also called a bill of materials), the nodes are assembly units that eventually resolve down to individual parts from inventory, and each edge is the “is made of” relationship

The top of the tree is called the root In an organizational chart, it is the highest authority; in a parts explosion, it is the final assembly The number of edges coming out of the node is its outdegree, and the number of edges entering it is its indegree A binary tree is one in which a parent can have at most two children; more generally, an nary tree is one in which a node can have at most outdegree n

The nodes of the tree that have no subtrees beneath them are called the leaf nodes In a parts explosion, they are the individual parts, which cannot be broken down any further The descendants, or

Trang 3

624 CHAPTER 28: TREES AND HIERARCHIES IN SQL

children, of a node (the parent) are every node in the subtree that has the parent node as its root

There are several ways to define a tree: it is a graph with no cycles; it

is a graph where all nodes except the root have indegree one and the root has indegree zero Another defining property is that a path can be found from the root to any other node in the tree by following the edges in their natural direction

The tree structure and the nodes are very different things and therefore should be modeled in separate tables But I am going to violate that design rule in this chapter and use an abstract tree in this chapter (see Figure 28.1)

This little tree is small enough that you can remember what it looks like as you read the rest of this chapter It will illustrate the various techniques discussed here I will use the terms “child,” “parent,” and

“node,” but you may see other terms used in various books on graphs

Most SQL databases use the adjacency list model for two reasons The first reason is that Dr Codd came up with it in the early days of the relational model, and nobody thought about it after that The second reason is that the adjacency list is a way of “faking” pointer chains, the traditional programming method in procedural languages for handling trees It is a recording of the edges in a “boxes and arrows” diagram, something like this simple table:

CREATE TABLE AdjTree (child CHAR(2) NOT NULL, parent CHAR(2), null is root PRIMARY KEY (child, parent));

Figure 28.1

An Abstract Tree

Model.

Trang 4

28.1 Adjacency List Model 625

AdjTree child parent

=============

'A' NULL 'B' 'A' 'C' 'A' 'D' 'C' 'E' 'C' 'F' 'C'

The queries for the leaf nodes and root are obvious The root has a

NULL parent, and the left nodes have no subordinates Each row models two nodes that share an adjacent edge in a directed graph The adjacency list model is both the most common and the worst possible tree model

On the other hand, it is the best way to model any general graph

28.1.1 Complex Constraints

The first problem is that the adjacency list model requires complex constraints to maintain any data integrity In practice, the usual solution

is to ignore the problems and hope that nothing bad happens to the structure But if you care about data integrity, you need to be sure that:

1 There is only one root node

CREATE TABLE AdjTree (child CHAR(2) NOT NULL, parent CHAR(2), null is root PRIMARY KEY (child, parent), CONSTRAINT one_root

CHECK((SELECT COUNT(*) FROM AdjTree WHERE parent IS NULL) = 1) );

2 There are no cycles Unfortunately, this cannot be done without a trigger The trigger code must trace all the paths looking for a cycle The most obvious constraint to prohibit a single node cycle in the graph would be:

Trang 5

CHECK (child <> parent) - cannot be your own father!

But that does not detect (n > 2) node cycles We know that the number of edges in a tree is the number of nodes minus one, so this is a connected graph That constraint looks like this:

CHECK ((SELECT COUNT(*) FROM AdjTree) -1 edges = (SELECT COUNT(parent) FROM AdjTree)) nodes

The COUNT(parent) will drop the NULL in the root row That gives

us the effect of having a constraint to check for one NULL: CHECK((SELECT COUNT(*) FROM Tree WHERE parent IS NULL) = 1)

This is a necessary condition, but it is not a sufficient condition Consider this data, in which ‘D’ and ‘E’ are both in a cycle, and that cycle

is not in the tree structure

Cycle child parent

===========

'A' NULL 'B' 'A' 'C' 'A' 'D' 'E' 'E' 'D' One approach would be to remove all the leaf nodes and repeat this procedure until the tree is reduced to an empty set If the tree does not reduce to an empty set, then there is a disconnected cycle

CREATE FUNCTION TreeTest() RETURNS CHAR(6) LANGUAGE SQL

BEGIN ATOMIC DECLARE row_count INTEGER;

SET row_count = (SELECT COUNT(DISTINCT parent) + 1 FROM AdjTree);

put a copy in a temporary table INSERT INTO WorkTree

SELECT emp, parent FROM AdjTree;

WHILE row_count > 0

Trang 6

28.1 Adjacency List Model 627

DO DELETE FROM WorkTree prune leaf nodes

WHERE Tree.child

NOT IN (SELECT T2.parent

FROM Tree AS T2

WHERE T2.parent IS NOT NULL);

SET row_count = row_count -1;

END WHILE;

IF NOT EXISTS (SELECT * FROM WorkTree)

THEN RETURN ('Tree '); pruned everything

ELSE RETURN ('Cycles'); cycles were left

END IF;

END;

28.1.2 Procedural Traversal for Queries

The second problem is that the adjacency list model requires that you traverse from node to node to answer any interesting questions, such as

“Does Mr King have any authority over Mr Jones?” or any other aggregations up and down the tree

SELECT P1.child, ' parent to ', C1.child

FROM AdjTree AS P1, AdjTree AS C1

WHERE P1.child = C1.parent;

But something is missing here This query gives only the immediate parent of the node Your parent’s parent also has authority over you, and so forth, up the tree until we find someone who has no

subordinates To go two levels deep in the tree, we need to do a more complex self-JOIN, thus:

SELECT B1.child, ' parent to ', E2.child

FROM AdjTree AS B1, AdjTree AS E1, AdjTree AS E2

WHERE B1.child = E1.parent

AND E1.child = E2.parent;

Unfortunately, you have no idea just how deep the tree is, so you must keep extending this query until you get an empty set back as a result The practical problem is that most SQL compilers will start having serious problems optimizing queries with a large number of tables The other methods are to declare a CURSOR and traverse the tree with procedural code This is usually painfully slow, but it will work for any depth of tree It also defeats the purpose of using a nonprocedural

Trang 7

language like SQL With Common Table Expressions in SQL-99, you can also write a query that recursively constructs the transitive closure of the table by hiding the traversal This feature is not popular yet, and it is still slow compared to the nested sets model

28.1.3 Altering the Table

Insertion of a new node is the only easy operation in the adjacency list model You simply do an INSERT INTO statement and check to see that the parent already exists in the table

Deleting an edge in the middle of tree will cause the table to become a forest of separate trees You need some rule for rearranging the structure The two usual methods are to promote a subordinate to the vacancy (and cascade the vacancy downward) or to assign all the subordinates to their parent’s parent (the orphans go to live with grandparents)

Consider what has to happen when a middle-level node is changed The change must occur in both the child and parent columns

UPDATE AdjTree SET child = CASE WHEN child = 'C' THEN 'C1', ELSE child END, parent

= CASE WHEN parent= 'C' THEN 'C1', ELSE parent END WHERE 'C' IN (parent, child);

The next method for representing hierarchies in SQL was first discussed

in detail by Stefan Gustafsson on an Internet site for SQL Server users Later, Tom Moreau and Itzik Ben-Gan developed it in more detail in their book Advanced Transact-SQL for SQL Server 2000 (Moreau and Ben-Gan first edition was October 2000) This model stores the path from the root to each node as a string at that node

Of course, we purists might object that this is a denormalized table, since the path is not a scalar value The worst-case operation you can do

in this representation is to alter the root of the tree We then have to recalculate all the paths in the entire tree But if the assumption is that structural modifications high in the tree are relatively uncommon, then

Trang 8

28.2 The Path Enumeration Model 629

this might not be a problem The table for the simple tree we will use for this chapter looks like this:

CREATE TABLE PathTree

(node CHAR(2) NOT NULL PRIMARY KEY,

path VARCHAR (900) NOT NULL);

The example tree would get the following representation:

node path

===========

'A' 'a/'

'B' 'a/b/'

'C' 'a/c/'

'D' 'a/c/d/'

'E' 'a/c/e/'

'F' 'a/c/f/'

What we have done is concatenate the node names and separate them with a slash All of the operations will depend on string manipulations,

so we’d like to have short node identifiers to keep the paths short We would prefer, but not require, identifiers of one length to make

substrings easier

You have probably recognized this because I used a slash separator; this is a version of the directory paths used in several operating systems such as the UNIX family and Windows

28.2.1 Finding Subtrees and Nodes

The major trick in this model is the LIKE predicate The subtree rooted

at :my_node is found with this query

SELECT node

FROM PathTree

WHERE path LIKE '%' || :my_node || '%';

Finding the root node is easy, since that is the substring of any node

up to the first slash However, the leaf nodes are harder

SELECT T1.node

FROM PathTree AS T1

Trang 9

WHERE NOT EXISTS (SELECT * FROM PathTree AS T2 WHERE T2.path LIKE T1.path || '/_');

28.2.2 Finding Levels and Subordinates

The depth of a node is shown by the number of ‘/’ characters in the path string If you have a REPLACE() that can remove the ‘/’ characters, the difference between the length of the part with and without those characters gives you the level

CREATE VIEW DetailedTree (node, path, level)

AS SELECT node, path, CHARLENGTH (path)

- CHARLENGTH (REPLACE (path, '/', '')) FROM PathTree;

The immediate descendents of a given node can be found with this query, if you know the length of the node identifiers In this sample data, that length is one character:

SELECT :mynode, T2.node FROM PathTree AS T1, PathTree AS T2 WHERE T1.node = :mynode

AND T2.path LIKE T1.path || '_/';

This can be expanded with ORed like predicates that cover the possible lengths of the node identifiers

28.2.3 Deleting Nodes and Subtrees

This is a bit weird at first, because the removal of a node requires that you first update all the paths Let us delete node ‘B’ in the sample tree: BEGIN ATOMIC

UPDATE PathTree SET path = REPLACE (path, 'b/', '') WHERE POSITION ('b/' IN path) > 0;

DELETE FROM PathTree WHERE node = 'B';

END;

Trang 10

28.3 Nested Set Model of Hierarchies 631

Deleting a subtree rooted at :my_node is actually simpler:

DELETE FROM PathTree

WHERE path LIKE (SELECT path

FROM PathTree

WHERE node = :my_node ||'%';

28.2.4 Integrity Constraints

If a path has the same node in it twice, then there is a cycle in the graph

We can use a VIEW with just the node names in it to some advantage here

CHECK (NOT EXISTS

(SELECT *

FROM NodeList AS D1, PathTree AS P1

WHERE CHAR_LENGTH (REPLACE (D1.node, P1.path, '')) < (CHAR_LENGTH(P1.path) - CHAR_LENGTH(D1.node)) ))

Unfortunately, a subquery in a constraint is not widely implemented yet

Since SQL is a set-oriented language, the nested set model is a better model for the approach discussed here If you have used HTML, XML or

a language with a block structure, then you understand the basic idea of this model The lft and rgt columns (their names are abbreviations for

“left” and “right,” which are reserved words in Standard SQL) are the count of the “tags” in an XML representation of a tree

Imagine circles inside circles without any of them overlapping, the way you would draw a markup language structure This has some predictable results that we can use for building queries, as shown in Figures 28.2, 28.3, and 28.4

If that mental model does not work for you, to convert the “boxes and arrows” graph into a nested set model, think of a little worm crawling along the tree The worm starts at the top, the root, makes a complete trip around the tree When he comes to a node, he puts a number in the cell on the side that he is visiting and increments his counter Each node will get two numbers, one for the right side and one for the left

Định dạng
Số trang	10
Dung lượng	306,77 KB