Joe Celko s SQL for Smarties - Advanced SQL Programming P72 pdf

30.1 Basic Graph Characteristics 683CREATE VIEW GraphNodes node_id AS SELECT DISTINCT node_id FROM NestedSetsGraph; 30.1.2 Path Endpoints A path through a graph is a traversal of consecu

Trang 1

682 CHAPTER 30: GRAPHS IN SQL

The most common way to model a graph in SQL is with an adjacency list model Each edge of the graph is shown as a pair of nodes in which the ordering matters, and then any values associated with that edge are shown in another column

30.1 Basic Graph Characteristics

The following code is from John Gilson This code uses an adjacency list model of the graph, with nodes in a separate table This is the most common method for modeling graphs in SQL

CREATE TABLE Nodes (node_id INTEGER NOT NULL PRIMARY KEY);

CREATE TABLE AdjacencyListGraph (begin_node_id INTEGER NOT NULL REFERENCES Nodes (node_id), end_node_id INTEGER NOT NULL REFERENCES Nodes (node_id), PRIMARY KEY (begin_node_id, end_node_id),

CHECK (begin_node_id <> end_node_id));

It is also possible to load an acyclic directed graph into a nested set model by splitting the nodes

CREATE TABLE NestedSetsGraph (node_id INTEGER NOT NULL REFERENCES Nodes (node_id), lft INTEGER NOT NULL CHECK (lft >= 1) PRIMARY KEY, rgt INTEGER NOT NULL UNIQUE,

CHECK (rgt > lft), UNIQUE (node_id, lft));

To split nodes, start at the sink nodes and move up the tree When you come to a node with an indegree greater than one, replace it with that many copies of the node under each of its superiors Continue to do this until you get to the root The acyclic graph will become a tree, but with duplicated node values There are advantages to this model; we will discuss them in Section 30.3

30.1.1 All Nodes in the Graph

To view all nodes in the graph, use the following:

Trang 2

30.1 Basic Graph Characteristics 683

CREATE VIEW GraphNodes (node_id) AS

SELECT DISTINCT node_id FROM NestedSetsGraph;

30.1.2 Path Endpoints

A path through a graph is a traversal of consecutive nodes along a sequence of edges Clearly, the node at the end of one edge in the sequence must also be the node at the beginning of the next edge in the sequence The length of the path is the number of edges that are traversed along the path

Path endpoints are the first and last nodes of each path in the graph For a path of length zero, the path endpoints are the same node If there

is more than one path between two nodes, each path will be distinguished by its own distinct set of number pairs for the nested-set representation.

If there is only one path, P, between two nodes, but P is a subpath of more than one distinct path, then the endpoints of P will have number pairs for each of these greater paths As a canonical form, the least-numbered pairs are returned for these endpoints

CREATE VIEW PathEndpoints (begin_node_id, end_node_id, begin_lft, begin_rgt, end_lft, end_rgt) AS

SELECT G1.node_id, G2.node_id, G1.lft, G1.rgt, G2.lft, G2.rgt FROM (SELECT node_id, MIN(lft), MIN(rgt) FROM NestedSetsGraph

GROUP BY node_id) AS G1 (node_id, lft, rgt) INNER JOIN

NestedSetsGraph AS G2

ON G2.lft >= G1.lft AND G2.lft < G1.rgt;

30.1.3 Reachable Nodes

If a node is reachable from another node, then a path exists from the one node to the other It is assumed that every node is reachable from itself

Trang 3

CREATE VIEW ReachableNodes (begin_node_id, end_node_id) AS

SELECT DISTINCT begin_node_id, end_node_id FROM PathEndpoints;

30.1.4 Edges

Edges are pairs of adjacent connected nodes in the graph If edge E is represented by the pair of nodes (n0, n1), then n1 is reachable from n0

in a single traversal

CREATE VIEW Edges (begin_node_id, end_node_id) AS

SELECT begin_node_id, end_node_id FROM PathEndpoints AS PE

WHERE begin_node_id <> end_node_id AND NOT EXISTS

(SELECT * FROM NestedSetsGraph AS G WHERE G.lft > PE.begin_lft AND G.lft < PE.end_lft AND G.rgt > PE.end_rgt);

30.1.5 Indegree and Outdegree

The indegree of a node, n, is the number of distinct edges ending at n Nodes that have an indegree of zero are not returned To determine the indegree of all nodes in the graph:

CREATE VIEW Indegree (node_id, node_indegree) AS

SELECT N.node_id, COUNT(E.begin_node_id) FROM GraphNodes AS N

LEFT OUTER JOIN Edges AS E

ON N.node_id = E.end_node_id GROUP BY N.node_id;

The outdegree of a node, (n), is the number of distinct edges beginning at (n) Nodes that have an outdegree of zero are not returned

To determine the outdegree of all nodes in the graph:

Trang 4

30.1 Basic Graph Characteristics 685

CREATE VIEW Outdegree (node_id, node_outdegree)

AS

SELECT N.node_id, COUNT(E.end_node_id)

FROM GraphNodes AS N

LEFT OUTER JOIN

Edges AS E

ON N.node_id = E.begin_node_id

GROUP BY N.node_id;

30.1.6 Source, Sink, Isolated, and Internal Nodes

A source node of a graph has a positive outdegree but an indegree of zero; that is, it has edges leading from, but not to, the node This assumes there are no isolated nodes (nodes belonging to no edges).

CREATE VIEW SourceNodes (node_id, lft, rgt)

AS

SELECT node_id, lft, rgt

FROM NestedSetsGraph AS G1

WHERE NOT EXISTS

(SELECT *

FROM NestedSetsGraph AS G

WHERE G1.lft > G2.lft

AND G1.lft < G2.rgt);

Likewise, a sink node of a graph has positive indegree but an

outdegree of zero; that is, it has edges leading to, but not from, the node This assumes there are no isolated nodes

CREATE VIEW SinkNodes (node_id)

AS

SELECT node_id

WHERE lft = rgt - 1

AND NOT EXISTS

(SELECT *

WHERE G1.node_id = G2.node_id

AND G2.lft < G1.lft);

An isolated node belongs to no edges; i.e., it has zero indegree and zero outdegree

Trang 5

CREATE VIEW IsolatedNodes (node_id, lft, rgt) AS

SELECT node_id, lft, rgt FROM NestedSetsGraph AS G1 WHERE lft = rgt - 1

AND NOT EXISTS (SELECT * FROM NestedSetsGraph AS G2 WHERE G1.lft > G2.lft AND G1.lft < G2.rgt);

An internal node of a graph has an indegree greater than zero and an outdegree greater than zero; that is, it acts as both a source and a sink

CREATE VIEW InternalNodes (node_id) AS

SELECT node_id FROM (SELECT node_id, MIN(lft) AS lft, MIN(rgt) AS rgt FROM NestedSetsGraph

WHERE lft < rgt - 1 GROUP BY node_id) AS G1 WHERE EXISTS

(SELECT * FROM NestedSetsGraph AS G2 WHERE G1.lft > G2.lft AND G1.lft < G2.rgt)

Finding a path in a graph is the most important commercial application

of graphs Graphs model transportation networks, electrical and cable systems, process control flow and thousands of other things

A path, P, of length L from a node n0 to a node n k in the graph is defined as a traversal of ( L + 1) contiguous nodes along a sequence of edges, where the first node is node number 0 and the last is node number k

CREATE VIEW Paths (begin_node_id, end_node_id, this_node_id, seq_nbr,

begin_lft, begin_rgt, end_lft, end_rgt,

Trang 6

30.2 Paths in a Graph 687

this_lft, this_rgt)

AS

SELECT PE.begin_node_id, PE.end_node_id, G1.node_id,

(SELECT COUNT(*)

WHERE G2.lft > PE.begin_lft

AND G2.lft <= G1.lft

AND G2.rgt >= G1.rgt),

PE.begin_lft, PE.begin_rgt,

PE.end_lft, PE.end_rgt,

G1.lft, G1.rgt

FROM PathEndpoints AS PE

INNER JOIN

NestedSetsGraph AS G1

ON G1.lft BETWEEN PE.begin_lft

AND PE.end_lft

AND G1.rgt >= PE.end_rgt

30.2.1 Length of Paths

The length of a path is the number of edges that are traversed along the path A path of n nodes has a length of ( n − 1)

CREATE VIEW PathLengths

(begin_node_id, end_node_id,

path_length,

begin_lft, begin_rgt,

end_lft, end_rgt)

AS

SELECT begin_node_id, end_node_id, MAX(seq_nbr),

begin_lft, begin_rgt, end_lft, end_rgt

FROM Paths

GROUP BY begin_lft, end_lft, begin_rgt, end_rgt,

begin_node_id, end_node_id;

30.2.2 Shortest Path

The following code gives the shortest path length between all nodes, but it does not tell you what the actual path is There are other queries that use the new CTE feature and recursion, which we will discuss in Section 30.3

Trang 7

CREATE VIEW ShortestPathLengths (begin_node_id, end_node_id, path_length, begin_lft, begin_rgt, end_lft, end_rgt) AS

SELECT PL.begin_node_id, PL.end_node_id, PL.path_length,

PL.begin_lft, PL.begin_rgt, PL.end_lft, PL.end_rgt FROM (SELECT begin_node_id, end_node_id, MIN(path_length) AS path_length FROM PathLengths

GROUP BY begin_node_id, end_node_id) AS MPL INNER JOIN

PathLengths AS PL

ON MPL.begin_node_id = PL.begin_node_id AND MPL.end_node_id = PL.end_node_id AND MPL.path_length = PL.path_length;

30.2.3 Paths by Iteration

First, let’s build a graph that has a cost associated with each edge and put

it into an adjacency list model

INSERT INTO Edges (out_node, in_node, cost) VALUES ('A', 'B', 50),

('A', 'C', 30), ('A', 'D', 100), ('A', 'E', 10), ('C', 'B', 5), ('D', 'B', 20), ('D', 'C', 50), ('E', 'D', 10);

To find the shortest paths from one node to the other nodes it can reach, we can write this recursive VIEW

CREATE VIEW ShortestPaths (out_node, in_node, path_length) AS

WITH RECURSIVE Paths (out_node, in_node, path_length) AS

(SELECT out_node, in_node, 1 FROM Edges

Trang 8

UNION ALL

SELECT E1.out_node, P1.in_node, P1.path_length + 1

FROM Edges AS E1, Paths AS P1

WHERE E1.in_node = P1.out_node)

SELECT out_node, in_node, MIN(path_length)

FROM Paths

GROUP BY out_node, in_node;

out_node in_node path_length

============================

'A' 'B' 1

'A' 'C' 1

'A' 'D' 1

'A' 'E' 1

'C' 'B' 1

'D' 'B' 1

'D' 'C' 1

'E' 'B' 2

'E' 'D' 1

To find the shortest paths without recursion, stay in a loop and add one edge at a time to the set of paths defined so far

CREATE PROCEDURE IteratePaths()

LANGUAGE SQL

MODIFIES SQL DATA

BEGIN

DECLARE old_path_tally INTEGER;

SET old_path_tally = 0;

DELETE FROM Paths; clean out working table

INSERT INTO Paths

SELECT out_node, in_node, 1

FROM Edges; load the edges

add one edge to each path

WHILE old_path_tally < (SELECT COUNT(*) FROM Paths)

DO SET old_path_tally = (SELECT COUNT(*) FROM Paths);

INSERT INTO Paths (out_node, in_node, lgth)

SELECT E1.out_node, P1.in_node, (1 + P1.lgth)

FROM Edges AS E1, Paths AS P1

WHERE E1.in_node = P1.out_node

AND NOT EXISTS path is not here already

Trang 9

(SELECT * FROM Paths AS P2 WHERE E1.out_node = P2.out_node AND P1.in_node = P2.in_node);

END WHILE;

END;

The least cost path is basically the same algorithm, but instead of a constant of one for the path length, we use the actual costs of the edges

CREATE PROCEDURE IterateCheapPaths () LANGUAGE SQL

MODIFIES SQL DATA BEGIN

DECLARE old_path_cost INTEGER;

SET old_path_cost = 0;

DELETE FROM Paths; clean out working table INSERT INTO Paths

SELECT out_node, in_node, cost FROM Edges; load the edges add one edge to each path WHILE old_path_cost < (SELECT COUNT(*) FROM Paths)

DO SET old_path_cost = (SELECT COUNT(*) FROM Paths);

INSERT INTO Paths (out_node, in_node, cost) SELECT E1.out_node, P1.in_node, (E1.cost + P1.cost) FROM Edges AS E1

INNER JOIN (SELECT out_node, in_node, MIN(cost) FROM Paths

GROUP BY out_node, in_node)

AS P1 (out_node, in_node, cost)

ON E1.in_node = P1.out_node AND NOT EXISTS

(SELECT * FROM Paths AS P2 WHERE E1.out_node = P2.out_node AND P1.in_node = P2.in_node AND P2.cost <= E1.cost + P1.cost);

END WHILE;

END;

Trang 10

30.2.4 Listing the Paths

I took the data for this table from the book Introduction to Algorithms

(Cormen, Leiserson, and Rivest 1990), page 518 This book was very popular in college courses in the United States I made one decision that will be important later: I added self-traversal edges (i.e., the node is both the out_node and the in_node of an edge) with weights of zero.

INSERT INTO Edges VALUES ('s', 's', 0);

INSERT INTO Edges VALUES ('s', 'u', 3);

INSERT INTO Edges VALUES ('s', 'x', 5);

INSERT INTO Edges VALUES ('u', 'u', 0);

INSERT INTO Edges VALUES ('u', 'v', 6);

INSERT INTO Edges VALUES ('u', 'x', 2);

INSERT INTO Edges VALUES ('v', 'v', 0);

INSERT INTO Edges VALUES ('v', 'y', 2);

INSERT INTO Edges VALUES ('x', 'u', 1);

INSERT INTO Edges VALUES ('x', 'v', 4);

INSERT INTO Edges VALUES ('x', 'x', 0);

INSERT INTO Edges VALUES ('x', 'y', 6);

INSERT INTO Edges VALUES ('y', 's', 3);

INSERT INTO Edges VALUES ('y', 'v', 7);

INSERT INTO Edges VALUES ('y', 'y', 0);

I am not happy about this approach, because I have to decide the maximum number of edges in a path before I start looking for an answer But this solution will work, and I know that a path will have no more than the total number of nodes in the graph Let’s create a table to hold the paths:

CREATE TABLE Paths

(step1 CHAR(2) NOT NULL,

step2 CHAR(2) NOT NULL,

total_cost INTEGER NOT NULL,

path_length INTEGER NOT NULL,

PRIMARY KEY (step1, step2, step3, step4, step5));

Định dạng
Số trang	10
Dung lượng	130,24 KB