SQL clause Relational operation Meaning FROM a single table none Input table FROM table1, table2 table1 X table2 Cartesian product FROM table1 JOIN table2 ON conditions table1 con
Trang 1Chapter 4: Algorithms for Query
Processing and Optimization
Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering
Database Management Systems
(CO3021)
Computer Science Program
Dr Võ Thị Ngọc Châu (chauvtn@hcmut.edu.vn)
Trang 2Course outline
Chapter 1 Overall Introduction to Database
Management Systems
Chapter 2 Disk Storage and Basic File Structures
Chapter 3 Indexing Structures for Files
Optimization
Chapter 5 Introduction to Transaction Processing Concepts and Theory
Chapter 6 Concurrency Control Techniques
Chapter 7 Database Recovery Techniques
Trang 3References
[1] R Elmasri, S R Navathe, Fundamentals of Database
Systems- 6th Edition, Pearson- Addison Wesley, 2011
R Elmasri, S R Navathe, Fundamentals of Database Systems- 7th Edition, Pearson, 2016
[2] H G Molina, J D Ullman, J Widom, Database System
Implementation, Prentice-Hall, 2000
[3] H G Molina, J D Ullman, J Widom, Database Systems: The Complete Book, Prentice-Hall, 2002
[4] A Silberschatz, H F Korth, S Sudarshan, Database
System Concepts –3rd Edition, McGraw-Hill, 1999
[Internet] …
Trang 4Content
4.1 Introduction to Query Processing
4.2 Translating SQL Queries into Relational Algebra
4.3 Algorithms for External Sorting
4.4 Algorithms for SELECT and JOIN Operations
4.5 Algorithms for PROJECT and SET Operations
4.6 Implementing Aggregate Operations and Outer Joins
4.7 Combining Operations using Pipelining
4.8 Using Heuristics in Query Optimization
4.9 Using Selectivity and Cost Estimates in Query Optimization
4.10 Overview of Query Optimization in Oracle
Trang 54.1 Introduction to Query Processing
5
CREATE TABLE EMPLOYEE (
Fname VARCHAR(15) NOT NULL,
Minit CHAR,
Lname VARCHAR(15) NOT NULL,
Ssn CHAR(9) NOT NULL,
ON DELETE SET NULL ON UPDATE CASCADE, CONSTRAINT EMPDEPTFK
FOREIGN KEY(Dno) REFERENCES DEPARTMENT(Dnumber)
ON DELETE SET DEFAULT ON UPDATE CASCADE);
Trang 64.1 Introduction to Query Processing
How would you do for such results?
Retrieve SSN, last name, and
department number of all the
employees who work in
department 1 or were born after
01/01/1955 with salary higher
SELECT SSN, LNAME, DNO
FROM EMPLOYEE
WHERE DNO = 1 OR
(BDATE > '01/01/1955'
AND SALARY > 30000);
Trang 7Typical steps when
processing a high-level query
SELECT SSN, LNAME, DNO
FROM EMPLOYEE
WHERE DNO = 1 OR
(BDATE > '01/01/1955'
AND SALARY > 30000);
Trang 8 A query is expressed in a high-level query
language such as SQL
scanned, parsed, validated
The scanner identifies the query tokens
(SQL keywords, attribute names, and relation names) that appear in the query text
The parser checks the query syntax to
determine whether it is formulated according
to the syntax grammar rules of the language
The validator checks if all attribute and
relation names are valid and semantically
meaningful names in the database schema
4.1 Introduction to Query Processing
Trang 9 The query is represented in an intermediate form, i.e internal representation
Query Tree
Query Graph
The DBMS must then devise an execution strategy
or query plan for retrieving the results of the query
from the database files
An execution plan includes details about the access
methods available for each relation and the algorithms to be
used in computing the relational operators represented in the tree
A query has many possible execution plans, and the process
of choosing a suitable one for processing a query is query
4.1 Introduction to Query Processing
Trang 10 The query optimizer module has the task
of producing a good execution plan
the code generator generates the code to
execute that plan
The runtime database processor has the
task of running (executing) the query code, whether in compiled or interpreted mode, to produce the query result
If a runtime error results, an error message is
generated by the runtime database processor
4.1 Introduction to Query Processing
Trang 11 Query Tree
A tree data structure corresponds to an
extended relational algebra expression
It represents the input relations of the query as
leaf nodes of the tree
It represents the relational algebra operations as
internal nodes
An execution of the query tree consists of
executing an internal node operation whenever
its operands are available and then replacing that internal node by the relation that results from
executing the operation
The order of execution of operations starts at the
4.1 Introduction to Query Processing
Trang 12 Query Graph
Relations in the query are represented by
relation nodes, which are displayed as single
circles
Constant values, typically from the query
selection conditions, are represented by
constant nodes, which are displayed as double
circles or ovals
Selection and join conditions are represented by
the graph edges
The attributes to be retrieved from each relation are displayed in square brackets above each
relation
4.1 Introduction to Query Processing
Trang 13Query Tree
4.1 Introduction to Query Processing
SELECT SSN, LNAME, DNO
[SSN, LNAME, DNO]
DNO=1 BDATE>'01/01/1955' SALARY>30000
Trang 144.2 Translating SQL Queries into
Relational Algebra
An SQL query is first translated into an
equivalent extended relational algebra
expression—represented as a query tree
data structure—that is then optimized
SQL clause Relational operation Meaning
FROM a single table (none) Input table
FROM table1, table2 table1 X table2 Cartesian product FROM table1 JOIN table2
ON conditions table1 conditions table2 Theta join
WHERE conditions conditions Selection
SELECT an attribute list an attribute list Projection
SELECT a function list
…
[GROUP BY a grouping
<a grouping attribute list> ℑ
<function list> Aggregation
Trang 154.2 Translating SQL Queries into
Relational Algebra
translated into the algebraic operators and
optimized
A query block contains a single WHERE expression, as well as GROUP BY and
SELECT-FROM-HAVING clauses if these are part of the block
as separate query blocks
Aggregate operators (MAX, MIN, COUNT, SUM) must be included in the extended algebra
Trang 164.2 Translating SQL Queries into
Relational Algebra
Retrieve the names of employees (from any department in
the company) who earn a salary that is greater than the
highest salary in department 5
Trang 17Relational Algebra
Trang 184.3 Algorithms for External Sorting
Sorting is one of the primary algorithms used
in query processing
the ORDER BY clause
sort-merge algorithms for JOIN and set operations
duplicate elimination algorithms for the PROJECT operation
DISTINCT in the SELECT clause
External sorting : refers to sorting
algorithms that are suitable for large files of records stored on disk that do not fit entirely
in main memory
Trang 194.3 Algorithms for External Sorting
Sort-Merge strategy : starts by sorting small
subfiles (runs) of the main file and then merges
the sorted runs, creating larger sorted subfiles
that are merged in turn
– Sorting phase: nR = b/nB
– Merging phase: dM = min(nB-1, nR)
nP = logdM(nR)
b: number of file blocks
nB: available buffer space
nR: number of initial runs
dM: degree of merging
Trang 20set i 1; j b; /* size of the file in blocks */
k nB; /* size of buffer in blocks */
m j/k ; /* the number of runs */
/*Sort phase*/
while (i<= m) do
{
read next k blocks of the file into the buffer or if
there are less than k blocks remaining, then read
in the remaining blocks;
sort the records in the buffer and write as a
Trang 21/*Merge phase: merge subfiles until only one remains */
set i 1;
p logk-1m ;/*p: number of passes in the merging phase*/
j m; /* the number of runs */
read next k-1 subfiles or remaining subfiles (from
previous pass) one block at a time
merge and write as new subfile one block at a time;
Trang 22a sorting field
4.3 Algorithms for External Sorting
Trang 23CREATE TABLE EMPLOYEE (
Fname VARCHAR(15) NOT NULL,
Lname VARCHAR(15) NOT NULL,
Ssn CHAR(9) NOT NULL,
… CONSTRAINT EMPDEPTFK FOREIGN KEY(Dno) REFERENCES DEPARTMENT(Dnumber)
ON DELETE SET DEFAULT ON UPDATE CASCADE);
4.4 Algorithms for SELECT and
JOIN Operations
CREATE TABLE DEPARTMENT (
Dname VARCHAR(15) NOT NULL, Dnumber INT NOT NULL,
Mgr_ssn CHAR(9) NOT NULL, Mgr_start_date DATE,
PRIMARY KEY (Dnumber), UNIQUE (Dname),
FOREIGN KEY (Mgr_ssn) REFERENCES EMPLOYEE(Ssn) );
CREATE TABLE WORKS_ON (
Essn CHAR(9) NOT NULL, Pno INT NOT NULL,
Hours DECIMAL(3,1) NOT NULL, PRIMARY KEY (Essn, Pno),
FOREIGN KEY (Essn) REFERENCES EMPLOYEE(Ssn),
Trang 24Given the tables, some examples for selection:
OP1: σSSN='123456789'(EMPLOYEE)
OP2: σDNUMBER>5(DEPARTMENT)
OP3: σDNO=5(EMPLOYEE)
OP4: σDNO=5 AND SALARY>30000 AND SEX='F' (EMPLOYEE)
OP4‘: σDno=5 OR Salary > 30000 OR Sex ='F' (EMPLOYEE)
OP5: σESSN='123456789' AND PNO=10(WORKS_ON)
OP6: σDNO IN (3, 27, 49)(EMPLOYEE)
OP7: σ((Salary*Commission_pct) + Salary ) > 5000(EMPLOYEE)
4.4 Algorithms for SELECT and
JOIN Operations
SELECT * FROM TABLE WHERE CONDITIONs;
Trang 25Implementing the SELECT Operation: Search
record in the file, and test whether its attribute
values satisfy the selection condition
involves an equality comparison on a key
attribute on which the file is ordered, binary
search (which is more efficient than linear
search) can be used
retrieve a single record: If the selection condition involves an equality comparison on a key
attribute with a primary index (or a hash key),
use the primary index (or the hash key) to
4.4 Algorithms for SELECT and
JOIN Operations
Trang 26Implementing the SELECT Operation: Search
S4 Using a primary index to retrieve
multiple records: If the comparison condition is
>, ≥ , <, or ≤ on a key field with a primary
index, use the index to find the record
satisfying the corresponding equality condition, then retrieve all subsequent records in the
(ordered) file
S5 Using a clustering index to retrieve
multiple records: If the selection condition
involves an equality comparison on a non-key attribute with a clustering index, use the
clustering index to retrieve all the records
satisfying the selection condition
4.4 Algorithms for SELECT and
JOIN Operations
Trang 27Implementing the SELECT Operation: Search
an equality comparison, this search method can
be used to retrieve a single record if the
indexing field has unique values (is a key) or to retrieve multiple records if the indexing field is
not a key In addition, it can be used to retrieve records on conditions involving >,>=, <, or
<= (FOR RANGE QUERIES )
4.4 Algorithms for SELECT and
JOIN Operations
Trang 28Implementing the SELECT Operation: Search
condition involves a set of values for an attribute, the corresponding bitmaps for each value can be
OR-ed to give the set of record identifiers that
qualify
S7.b Using a functional index: If there is a
functional index defined, this index can be used to retrieve all the records that qualify
4.4 Algorithms for SELECT and
JOIN Operations
CREATE INDEX income_ix
ON EMPLOYEE (Salary + (Salary*Commission_pct));
This index can be used for OP7
Trang 29Implementing the SELECT Operation: Search
attribute involved in any single simple condition in
the conjunctive condition has an access path that permits the use of one of the methods S2 to S6,
use that condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive
condition
composite index: If two or more attributes are
involved in equality conditions in the conjunctive
condition and a composite index (or hash
structure) exists on the combined field, we can
use the index directly
4.4 Algorithms for SELECT and
JOIN Operations
Trang 30Implementing the SELECT Operation: Search
S10 Conjunctive (AND) selection by
intersection of record pointers : This method
is possible if secondary indexes are available on
all (or some of) the fields involved in equality
comparison conditions in the conjunctive
condition and if the indexes include record
pointers (rather than block pointers) Each index
can be used to retrieve the record pointers that
satisfy the individual condition The intersection
of these sets of record pointers gives the record
pointers that satisfy the conjunctive condition,
which are then used to retrieve those records
directly If only some of the conditions have
secondary indexes, each retrieved record is
further tested to determine whether it satisfies
the remaining conditions
4.4 Algorithms for SELECT and
JOIN Operations
Trang 31Implementing the SELECT Operation: Search
Disjunctive (OR) selection conditions: With a
disjunctive selection condition, the records
satisfying the disjunctive condition are the union
of the records satisfying the individual conditions
Hence, if any one of the conditions does not have
an access path, we are compelled to use the brute force, linear search approach Only if an access
path exists on every simple condition in the
disjunction can we optimize the selection by
retrieving the records satisfying each condition—
or their record identifiers—and then applying the
union operation to eliminate duplicates
4.4 Algorithms for SELECT and
JOIN Operations
Trang 32Algorithms for SELECT
σDno=5 OR Salary > 30000 OR Sex ='F' (EMPLOYEE)
Linear search: Each block is loaded in the buffer
The records are then checked
there for all the conditions in OR
……
Data file
σDno=5 OR Salary > 30000 OR Sex ='F' (EMPLOYEE)
Result 1 Result 2 Result 3
union
Trang 33Implementing the SELECT Operation: Search
Whenever a single condition specifies the
selection, we can only check whether an access
path exists on the attribute involved in that
condition If an access path exists, the method
corresponding to that access path is used;
otherwise, the ―brute force‖ linear search
approach of method S1 is used
The query optimizer must choose the appropriate one for executing each SELECT operation in a
query
This optimization uses formulas that estimate the costs
for each available access method
The optimizer chooses the access method with the lowest estimated cost.
4.4 Algorithms for SELECT and
JOIN Operations
Trang 34Given EMPLOYEE and DEPARTMENT tables:
OP1: σSSN='123456789'(EMPLOYEE)
OP2: σDNUMBER>5(DEPARTMENT)
OP3: σDNO=5(EMPLOYEE)
OP4: σDNO=5 AND SALARY>30000 AND SEX='F' (EMPLOYEE)
OP4‘: σDno=5 OR Salary > 30000 OR Sex ='F' (EMPLOYEE)
OP5: σESSN='123456789' AND PNO=10(WORKS_ON)
OP6: σDNO IN (3, 27, 49)(EMPLOYEE)
OP7: σ((Salary*Commission_pct) + Salary ) > 5000(EMPLOYEE)
Which search method should be used?
Trang 35Implementing the JOIN Operation:
Join (EQUIJOIN, NATURAL JOIN)
– two–way join: a join on two files
e.g R A=B S
– multi-way join: a join involving more than two files
e.g R A=B S C=DT
Examples
OP8: EMPLOYEE DNO=DNUMBERDEPARTMENT
OP9: DEPARTMENT MGR_SSN=SSNEMPLOYEE
4.4 Algorithms for SELECT and
JOIN Operations
SELECT * FROM R JOIN S ON A=B;
Trang 36Implementing the JOIN Operation:
to retrieve the matching records)
4.4 Algorithms for SELECT and
JOIN Operations
Trang 37Implementing the JOIN Operation:
record t in R (outer loop), retrieve every record s from S (inner loop) and test whether the two
records satisfy the join condition t[A] = s[B]
4.4 Algorithms for SELECT and
JOIN Operations
for each record t in each block of R
for each record s in each block of S
if (t[A] = s[B])
add (t, s) into the result
How many block accesses are needed with a (memory) buffer? Which (large or small) table should be on the outer loop?
Trang 38Implementing the JOIN Operation:
4.4 Algorithms for SELECT and
JOIN Operations
for each record t in each block of R
for each record s in each block of S
if (t[A] = s[B])
add (t, s) into the result OP8: EMPLOYEE DNO=DNUMBERDEPARTMENT
The number of blocks of EMPLOYEE bE = 2000 blocks
The number of blocks of DEPARTMENT bD = 10 blocks
Buffer size nB = 7 blocks
Trang 39 J1 Nested-loop join (brute force):
4.4 Algorithms for SELECT and
JOIN Operations
OP8: EMPLOYEE DNO=DNUMBERDEPARTMENT
bE = 2000 blocks, bD = 10 blocks, nB = 7 blocks
J1.1 EMPLOYEE on the outer loop
Cost = bE + bD* bE/(nB-2) = 2000 + 10* 2000/(7-2)
Cost = 6000 block accesses
J1.2 DEPARTMENT on the outer loop
Cost = bD + bE* bD/(nB-2) = 10 + 2000* 10/(7-2)
Cost = 4010 block accesses
Buffer
Outer Inner Result
Smaller file on the outer loop!!!
Trang 40Implementing the JOIN Operation:
to retrieve the matching records): If an index (or hash key) exists for one of the two join
attributes — say, B of S — retrieve each record t
in R (loop over R) and then use the access
structure to retrieve directly all matching records
s from S that satisfy s[B] = t[A]
4.4 Algorithms for SELECT and