624 CHAPTER 28: TREES AND HIERARCHIES IN SQL children, of a node the parent are every node in the subtree that has the parent node as its root.. Another defining property is that a path
Trang 1622 CHAPTER 27: SUBSETS
fourth step: distinct missing ifcs that are also missing clubs INSERT INTO Samples (member_id, club_name, ifc)
SELECT MIN(Memberships.member_id), MIN(Memberships.club_name, Memberships.ifc
FROM Memberships, MissingClubs, MissingIfcs
WHERE Memberships.club_name = MissingClubs.club_name)
AND Memberships.ifc = MissingIfcs.ifc
GROUP BY Memberships.ifc;
fifth step: remaining missing ifcs
INSERT INTO Samples (member_id, club_name, ifc)
SELECT MIN(Memberships.member_id), MIN(memberships.club_name), memberships.ifc
FROM Memberships, MissingIfcs
WHERE Memberships.ifc = MissingIfcs.ifc
GROUP BY Memberships.ifc;
sixth step: remaining missing clubs
INSERT INTO Samples (Member_id, club_name, ifc)
SELECT MIN(Memberships.Member_id), Memberships.club_name,
MIN(Memberships.ifc) FROM Memberships, MissingClubs
WHERE Memberships.club_name = MissingClubs.club_name
GROUP BY Memberships.club_name;
We can check the candidate rows for redundancy removal with the two views that were created earlier to be sure
Trang 2C H A P T E R
28 Trees and Hierarchies in SQL
I HAVE A SEPARATE book (Joe Celko’s Trees and Hierarchies in SQL for Smarties, 2004) devoted to this topic in great detail, so this chapter will
be a very quick discussion of the three major approaches to modeling trees and hierarchies in SQL
A tree is a special kind of directed graph Graphs are data structures that are made up of nodes (usually shown as boxes) connected by edges (usually shown as lines with arrowheads) Each edge represents a one-way relationship between the two nodes it connects In an organizational chart, the nodes are positions that can
be filled by employees, and each edge is the “reports to” relationship
In a parts explosion (also called a bill of materials), the nodes are assembly units that eventually resolve down to individual parts from inventory, and each edge is the “is made of” relationship
The top of the tree is called the root In an organizational chart, it is the highest authority; in a parts explosion, it is the final assembly The number of edges coming out of the node is its outdegree, and the number of edges entering it is its indegree A binary tree is one in which a parent can have at most two children; more generally, an nary tree is one in which a node can have at most outdegree n
The nodes of the tree that have no subtrees beneath them are called the leaf nodes In a parts explosion, they are the individual parts, which cannot be broken down any further The descendants, or
Trang 3624 CHAPTER 28: TREES AND HIERARCHIES IN SQL
children, of a node (the parent) are every node in the subtree that has the parent node as its root
There are several ways to define a tree: it is a graph with no cycles; it
is a graph where all nodes except the root have indegree one and the root has indegree zero Another defining property is that a path can be found from the root to any other node in the tree by following the edges in their natural direction
The tree structure and the nodes are very different things and therefore should be modeled in separate tables But I am going to violate that design rule in this chapter and use an abstract tree in this chapter (see Figure 28.1)
This little tree is small enough that you can remember what it looks like as you read the rest of this chapter It will illustrate the various techniques discussed here I will use the terms “child,” “parent,” and
“node,” but you may see other terms used in various books on graphs
Most SQL databases use the adjacency list model for two reasons The first reason is that Dr Codd came up with it in the early days of the relational model, and nobody thought about it after that The second reason is that the adjacency list is a way of “faking” pointer chains, the traditional programming method in procedural languages for handling trees It is a recording of the edges in a “boxes and arrows” diagram, something like this simple table:
CREATE TABLE AdjTree (child CHAR(2) NOT NULL, parent CHAR(2), null is root PRIMARY KEY (child, parent));
Figure 28.1
An Abstract Tree
Model.
Trang 428.1 Adjacency List Model 625
AdjTree child parent
=============
'A' NULL 'B' 'A' 'C' 'A' 'D' 'C' 'E' 'C' 'F' 'C'
The queries for the leaf nodes and root are obvious The root has a
NULL parent, and the left nodes have no subordinates Each row models two nodes that share an adjacent edge in a directed graph The adjacency list model is both the most common and the worst possible tree model
On the other hand, it is the best way to model any general graph
28.1.1 Complex Constraints
The first problem is that the adjacency list model requires complex constraints to maintain any data integrity In practice, the usual solution
is to ignore the problems and hope that nothing bad happens to the structure But if you care about data integrity, you need to be sure that:
1 There is only one root node
CREATE TABLE AdjTree (child CHAR(2) NOT NULL, parent CHAR(2), null is root PRIMARY KEY (child, parent), CONSTRAINT one_root
CHECK((SELECT COUNT(*) FROM AdjTree WHERE parent IS NULL) = 1) );
2 There are no cycles Unfortunately, this cannot be done without a trigger The trigger code must trace all the paths looking for a cycle The most obvious constraint to prohibit a single node cycle in the graph would be:
Trang 5626 CHAPTER 28: TREES AND HIERARCHIES IN SQL
CHECK (child <> parent) - cannot be your own father!
But that does not detect (n > 2) node cycles We know that the number of edges in a tree is the number of nodes minus one, so this is a connected graph That constraint looks like this:
CHECK ((SELECT COUNT(*) FROM AdjTree) -1 edges = (SELECT COUNT(parent) FROM AdjTree)) nodes
The COUNT(parent) will drop the NULL in the root row That gives
us the effect of having a constraint to check for one NULL: CHECK((SELECT COUNT(*) FROM Tree WHERE parent IS NULL) = 1)
This is a necessary condition, but it is not a sufficient condition Consider this data, in which ‘D’ and ‘E’ are both in a cycle, and that cycle
is not in the tree structure
Cycle child parent
===========
'A' NULL 'B' 'A' 'C' 'A' 'D' 'E' 'E' 'D' One approach would be to remove all the leaf nodes and repeat this procedure until the tree is reduced to an empty set If the tree does not reduce to an empty set, then there is a disconnected cycle
CREATE FUNCTION TreeTest() RETURNS CHAR(6) LANGUAGE SQL
BEGIN ATOMIC DECLARE row_count INTEGER;
SET row_count = (SELECT COUNT(DISTINCT parent) + 1 FROM AdjTree);
put a copy in a temporary table INSERT INTO WorkTree
SELECT emp, parent FROM AdjTree;
WHILE row_count > 0
Trang 628.1 Adjacency List Model 627
DO DELETE FROM WorkTree prune leaf nodes
WHERE Tree.child
NOT IN (SELECT T2.parent
FROM Tree AS T2
WHERE T2.parent IS NOT NULL);
SET row_count = row_count -1;
END WHILE;
IF NOT EXISTS (SELECT * FROM WorkTree)
THEN RETURN ('Tree '); pruned everything
ELSE RETURN ('Cycles'); cycles were left
END IF;
END;
28.1.2 Procedural Traversal for Queries
The second problem is that the adjacency list model requires that you traverse from node to node to answer any interesting questions, such as
“Does Mr King have any authority over Mr Jones?” or any other aggregations up and down the tree
SELECT P1.child, ' parent to ', C1.child
FROM AdjTree AS P1, AdjTree AS C1
WHERE P1.child = C1.parent;
But something is missing here This query gives only the immediate parent of the node Your parent’s parent also has authority over you, and so forth, up the tree until we find someone who has no
subordinates To go two levels deep in the tree, we need to do a more complex self-JOIN, thus:
SELECT B1.child, ' parent to ', E2.child
FROM AdjTree AS B1, AdjTree AS E1, AdjTree AS E2
WHERE B1.child = E1.parent
AND E1.child = E2.parent;
Unfortunately, you have no idea just how deep the tree is, so you must keep extending this query until you get an empty set back as a result The practical problem is that most SQL compilers will start having serious problems optimizing queries with a large number of tables The other methods are to declare a CURSOR and traverse the tree with procedural code This is usually painfully slow, but it will work for any depth of tree It also defeats the purpose of using a nonprocedural
Trang 7628 CHAPTER 28: TREES AND HIERARCHIES IN SQL
language like SQL With Common Table Expressions in SQL-99, you can also write a query that recursively constructs the transitive closure of the table by hiding the traversal This feature is not popular yet, and it is still slow compared to the nested sets model
28.1.3 Altering the Table
Insertion of a new node is the only easy operation in the adjacency list model You simply do an INSERT INTO statement and check to see that the parent already exists in the table
Deleting an edge in the middle of tree will cause the table to become a forest of separate trees You need some rule for rearranging the structure The two usual methods are to promote a subordinate to the vacancy (and cascade the vacancy downward) or to assign all the subordinates to their parent’s parent (the orphans go to live with grandparents)
Consider what has to happen when a middle-level node is changed The change must occur in both the child and parent columns
UPDATE AdjTree SET child = CASE WHEN child = 'C' THEN 'C1', ELSE child END, parent
= CASE WHEN parent= 'C' THEN 'C1', ELSE parent END WHERE 'C' IN (parent, child);
The next method for representing hierarchies in SQL was first discussed
in detail by Stefan Gustafsson on an Internet site for SQL Server users Later, Tom Moreau and Itzik Ben-Gan developed it in more detail in their book Advanced Transact-SQL for SQL Server 2000 (Moreau and Ben-Gan first edition was October 2000) This model stores the path from the root to each node as a string at that node
Of course, we purists might object that this is a denormalized table, since the path is not a scalar value The worst-case operation you can do
in this representation is to alter the root of the tree We then have to recalculate all the paths in the entire tree But if the assumption is that structural modifications high in the tree are relatively uncommon, then
Trang 828.2 The Path Enumeration Model 629
this might not be a problem The table for the simple tree we will use for this chapter looks like this:
CREATE TABLE PathTree
(node CHAR(2) NOT NULL PRIMARY KEY,
path VARCHAR (900) NOT NULL);
The example tree would get the following representation:
node path
===========
'A' 'a/'
'B' 'a/b/'
'C' 'a/c/'
'D' 'a/c/d/'
'E' 'a/c/e/'
'F' 'a/c/f/'
What we have done is concatenate the node names and separate them with a slash All of the operations will depend on string manipulations,
so we’d like to have short node identifiers to keep the paths short We would prefer, but not require, identifiers of one length to make
substrings easier
You have probably recognized this because I used a slash separator; this is a version of the directory paths used in several operating systems such as the UNIX family and Windows
28.2.1 Finding Subtrees and Nodes
The major trick in this model is the LIKE predicate The subtree rooted
at :my_node is found with this query
SELECT node
FROM PathTree
WHERE path LIKE '%' || :my_node || '%';
Finding the root node is easy, since that is the substring of any node
up to the first slash However, the leaf nodes are harder
SELECT T1.node
FROM PathTree AS T1
Trang 9630 CHAPTER 28: TREES AND HIERARCHIES IN SQL
WHERE NOT EXISTS (SELECT * FROM PathTree AS T2 WHERE T2.path LIKE T1.path || '/_');
28.2.2 Finding Levels and Subordinates
The depth of a node is shown by the number of ‘/’ characters in the path string If you have a REPLACE() that can remove the ‘/’ characters, the difference between the length of the part with and without those characters gives you the level
CREATE VIEW DetailedTree (node, path, level)
AS SELECT node, path, CHARLENGTH (path)
- CHARLENGTH (REPLACE (path, '/', '')) FROM PathTree;
The immediate descendents of a given node can be found with this query, if you know the length of the node identifiers In this sample data, that length is one character:
SELECT :mynode, T2.node FROM PathTree AS T1, PathTree AS T2 WHERE T1.node = :mynode
AND T2.path LIKE T1.path || '_/';
This can be expanded with ORed like predicates that cover the possible lengths of the node identifiers
28.2.3 Deleting Nodes and Subtrees
This is a bit weird at first, because the removal of a node requires that you first update all the paths Let us delete node ‘B’ in the sample tree: BEGIN ATOMIC
UPDATE PathTree SET path = REPLACE (path, 'b/', '') WHERE POSITION ('b/' IN path) > 0;
DELETE FROM PathTree WHERE node = 'B';
END;
Trang 1028.3 Nested Set Model of Hierarchies 631
Deleting a subtree rooted at :my_node is actually simpler:
DELETE FROM PathTree
WHERE path LIKE (SELECT path
FROM PathTree
WHERE node = :my_node ||'%';
28.2.4 Integrity Constraints
If a path has the same node in it twice, then there is a cycle in the graph
We can use a VIEW with just the node names in it to some advantage here
CHECK (NOT EXISTS
(SELECT *
FROM NodeList AS D1, PathTree AS P1
WHERE CHAR_LENGTH (REPLACE (D1.node, P1.path, '')) < (CHAR_LENGTH(P1.path) - CHAR_LENGTH(D1.node)) ))
Unfortunately, a subquery in a constraint is not widely implemented yet
Since SQL is a set-oriented language, the nested set model is a better model for the approach discussed here If you have used HTML, XML or
a language with a block structure, then you understand the basic idea of this model The lft and rgt columns (their names are abbreviations for
“left” and “right,” which are reserved words in Standard SQL) are the count of the “tags” in an XML representation of a tree
Imagine circles inside circles without any of them overlapping, the way you would draw a markup language structure This has some predictable results that we can use for building queries, as shown in Figures 28.2, 28.3, and 28.4
If that mental model does not work for you, to convert the “boxes and arrows” graph into a nested set model, think of a little worm crawling along the tree The worm starts at the top, the root, makes a complete trip around the tree When he comes to a node, he puts a number in the cell on the side that he is visiting and increments his counter Each node will get two numbers, one for the right side and one for the left