0321210255 split 2 4624

There are many ways in which a complex query can be performed, and one of the aims of query processing is to determine which one is the most cost effective.. Structure of this Chapter In

Trang 1

Chapter

Query Processing

Chapter Objectives

In this chapter you will learn:

n The objectives of query processing and optimization.

n Static versus dynamic query optimization.

n How to create a relational algebra tree to represent a query.

n The rules of equivalence for the relational algebra operations.

n How to apply heuristic transformation rules to improve the efficiency of a query.

n The types of database statistics required to estimate the cost of operations.

n The different strategies for implementing the relational algebra operations.

n How to evaluate the cost and size of the relational algebra operations.

n How pipelining can be used to improve the efficiency of queries.

n The difference between materialization and pipelining.

n The advantages of left-deep trees.

n Approaches for finding the optimal execution strategy.

When the relational model was first launched commercially, one of the major criticisms often cited was inadequate performance of queries Since then, a significant amount of research has been devoted to developing highly efficient algorithms for processing queries There are many ways in which a complex query can be performed, and one of the aims of query processing is to determine which one is the most cost effective

In first generation network and hierarchical database systems, the low-level procedural query language is generally embedded in a high-level programming language such as COBOL, and it is the programmer’s responsibility to select the most appropriate execution

strategy In contrast, with declarative languages such as SQL, the user specifies what data

is required rather than how it is to be retrieved This relieves the user of the responsibility

of determining, or even knowing, what constitutes a good execution strategy and makes the language more universally usable Additionally, giving the DBMS the responsibility

Trang 2

for selecting the best strategy prevents users from choosing strategies that are known to be

inefficient and gives the DBMS more control over system performance

There are two main techniques for query optimization, although the two strategies are

usually combined in practice The first technique uses heuristic rules that order the

opera-tions in a query The other technique compares different strategies based on their relative

costs and selects the one that minimizes resource usage Since disk access is slow

com-pared with memory access, disk access tends to be the dominant cost in query processing

for a centralized DBMS, and it is the one that we concentrate on exclusively in this

chap-ter when providing cost estimates

Structure of this Chapter

In Section 21.1 we provide an overview of query processing and examine the main phases

of this activity In Section 21.2 we examine the first phase of query processing, namely

query decomposition, which transforms a high-level query into a relational algebra query

and checks that it is syntactically and semantically correct In Section 21.3 we examine the

heuristic approach to query optimization, which orders the operations in a query using

transformation rules that are known to generate good execution strategies In Section 21.4

we discuss the cost estimation approach to query optimization, which compares different

strategies based on their relative costs and selects the one that minimizes resource usage

In Section 21.5 we discuss pipelining, which is a technique that can be used to further

improve the processing of queries Pipelining allows several operations to be performed in

a parallel way, rather than requiring one operation to be complete before another can start

We also discuss how a typical query processor may choose an optimal execution strategy

In the final section, we briefly examine how Oracle performs query optimization

In this chapter we concentrate on techniques for query processing and optimization in

centralized relational DBMSs, being the area that has attracted most effort and the model

that we focus on in this book However, some of the techniques are generally applicable

to other types of system that have a high-level interface Later, in Section 23.7 we briefly

examine query processing for distributed DBMSs In Section 28.5 we see that some of the

techniques we examine in this chapter may require further consideration for the

Object-Relational DBMS, which supports queries containing user-defined types and user-defined

functions

The reader is expected to be familiar with the concepts covered in Section 4.1 on the

relational algebra and Appendix C on file organizations The examples in this chapter are

drawn from the DreamHome case study described in Section 10.4 and Appendix A.

Overview of Query Processing

Query The activities involved in parsing, validating, optimizing, and executing

processing a query.

Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com

Trang 3

The aims of query processing are to transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy expressed in a low-level language (implementing the relational algebra), and to execute the strategy to retrieve the required data

Query The activity of choosing an efficient execution strategy for processing

optimization a query.

An important aspect of query processing is query optimization As there are many equivalent transformations of the same high-level query, the aim of query optimization

is to choose the one that minimizes resource usage Generally, we try to reduce the total execution time of the query, which is the sum of the execution times of all individual

operations that make up the query (Selinger et al., 1979) However, resource usage may

also be viewed as the response time of the query, in which case we concentrate on maximizing the number of parallel operations (Valduriez and Gardarin, 1984) Since the problem is computationally intractable with a large number of relations, the strategy adopted is generally reduced to finding a near optimum solution (Ibaraki and Kameda, 1984)

Both methods of query optimization depend on database statistics to evaluate properly the different options that are available The accuracy and currency of these statistics have a significant bearing on the efficiency of the execution strategy chosen The statistics cover information about relations, attributes, and indexes For example, the system catalog may store statistics giving the cardinality of relations, the number of distinct values for each attribute, and the number of levels in a multilevel index (see Appendix C.5.4) Keeping the statistics current can be problematic If the DBMS updates the statistics every time a tuple is inserted, updated, or deleted, this would have a significant impact on performance during peak periods An alternative, and generally preferable, approach is

to update the statistics on a periodic basis, for example nightly, or whenever the system is idle Another approach taken by some systems is to make it the users’ responsibility to indicate when the statistics are to be updated We discuss database statistics in more detail

in Section 21.4.1

As an illustration of the effects of different processing strategies on resource usage, we start with an example

Example 21.1 Comparison of different processing strategies

Find all Managers who work at a London branch.

We can write this query in SQL as:

SELECT * FROMStaff s, Branch b

WHEREs.branchNo=b.branchNoAND

(s.position= ‘Manager’ ANDb.city= ‘London’);

Trang 4

Three equivalent relational algebra queries corresponding to this SQL statement are:

(1) σ( position =‘Manager’) ∧ ( city =‘London’) ∧ ( Staff.branchNo = Branch.branchNo )(Staff×Branch)

(2) σ( position =‘Manager’) ∧ ( city =‘London’)(Staff1Staff.branchNo = Branch.branchNo Branch)

(3) (σposition =‘Manager’(Staff)) 1Staff.branchNo = Branch.branchNo(σcity =‘London’(Branch))

For the purposes of this example, we assume that there are 1000 tuples in Staff, 50 tuples

in Branch, 50 Managers (one for each branch), and 5 London branches We compare these

three queries based on the number of disk accesses required For simplicity, we assume

that there are no indexes or sort keys on either relation, and that the results of any

inter-mediate operations are stored on disk The cost of the final write is ignored, as it is the

same in each case We further assume that tuples are accessed one at a time (although in

practice disk accesses would be based on blocks, which would typically contain several

tuples), and main memory is large enough to process entire relations for each relational

algebra operation

The first query calculates the Cartesian product of Staff and Branch, which requires

(1000 + 50) disk accesses to read the relations, and creates a relation with (1000 * 50)

tuples We then have to read each of these tuples again to test them against the selection

predicate at a cost of another (1000 * 50) disk accesses, giving a total cost of:

(1000 + 50) + 2*(1000 * 50) = 101 050 disk accesses

The second query joins Staff and Branch on the branch number branchNo, which again

requires (1000 + 50) disk accesses to read each of the relations We know that the join of

the two relations has 1000 tuples, one for each member of staff (a member of staff can only

work at one branch) Consequently, the Selection operation requires 1000 disk accesses to

read the result of the join, giving a total cost of:

2*1000 + (1000 + 50) = 3050 disk accesses

The final query first reads each Staff tuple to determine the Manager tuples, which

requires 1000 disk accesses and produces a relation with 50 tuples The second Selection

operation reads each Branchtuple to determine the London branches, which requires 50

disk accesses and produces a relation with 5 tuples The final operation is the join of the

reduced Staffand Branch relations, which requires (50 + 5) disk accesses, giving a total

cost of:

1000 + 2*50 + 5 + (50 + 5) = 1160 disk accesses

Clearly the third option is the best in this case, by a factor of 87:1 If we increased the

number of tuples in Staffto 10 000 and the number of branches to 500, the improvement

would be by a factor of approximately 870:1 Intuitively, we may have expected this as the

Cartesian product and Join operations are much more expensive than the Selection

opera-tion, and the third option significantly reduces the size of the relations that are being joined

together We will see shortly that one of the fundamental strategies in query processing is

to perform the unary operations, Selection and Projection, as early as possible, thereby

reducing the operands of any subsequent binary operations

Trang 5

Query processing can be divided into four main phases: decomposition (consisting

of parsing and validation), optimization, code generation, and execution, as illustrated in Figure 21.1 In Section 21.2 we briefly examine the first phase, decomposition, before turning our attention to the second phase, query optimization To complete this overview,

we briefly discuss when optimization may be performed

Dynamic versus static optimization

There are two choices for when the first three phases of query processing can be carried out One option is to dynamically carry out decomposition and optimization every time the

query is run The advantage of dynamic query optimization arises from the fact that all

information required to select an optimum strategy is up to date The disadvantages are that the performance of the query is affected because the query has to be parsed, validated, and optimized before it can be executed Further, it may be necessary to reduce the num-ber of execution strategies to be analyzed to achieve an acceptable overhead, which may have the effect of selecting a less than optimum strategy

The alternative option is static query optimization, where the query is parsed,

vali-dated, and optimized once This approach is similar to the approach taken by a compiler for a programming language The advantages of static optimization are that the runtime

Figure 21.1

Phases of query

processing.

Trang 6

nested model 616–18

sagas 618–19

workflow models 621–2

architecture for 576–7

classification of 724 –5

concurrency control 577–605

deadlock 594 –7

granularity 602–5

locking methods 587–94

multiversion timestamp ordering

600–1

need for 577–80

optimistic techniques 601–2

recoverability 587

serializability 580–6

timestamping 597–600

and denormalization 531

design 300–1

as logical units of work 573

in object model 907

in OODBMS 871

in Oracle 774

in physical design 502–6, 1331

data usage 505–6

frequency information 504 –5

paths to relations 503 – 4

properties 575–6

in RDBMS 813

and recovery 607–9

serializability of 580–6

conflict, testing for 582–3

distributed 737

view serializability 583–6

testing for 584 –6

in SQL 187–9

transform methods in OODBMS 835

transform-oriented languages 109

transformation rules in relational algebra

640– 4

transformation tools in data warehousing

1165–6

transient objects 867, 902

transient versions 872

transitive closure in RDBMS 812–13

transitive dependency 396–7

transitive persistence 870

transparency of DDBMS 690, 719–28

distribution transparency 719–22

performance transparency 725–8

transaction transparency 722–5

transparent network access in Web-DBMS

applications 1008

transparent SQL access in Oracle 775

tree induction 1235

tree structure 1280

triggers

and denormalization 531

in Oracle 245, 263–7

in replication 790–1

in SQL 967–70 tuple relational calculus 103–7 expressions 105–7 safety of 106–7 formulae 105–7 tuple variables 103 tuples 69, 73– 4 distributing 529 Tuxedo 63 two-phase commit (2PC) 746–52 communication topologies 751–2 election protocols 751

termination protocols 748–50 two-phase locking (2PL) 589–91

in DDBMS centralized 738 distributed 2PL 739– 40 majority locking 740 primary copy 739 Two Phase Optimization 672 type hierarchy 374 type inheritance in Oracle 982 type model 982

typed views in SQL 965–6 types in Object Model 906 typespecs in ObjectStore 927 typing judgment 1126 unary operations 89–91 unary relationship 349 uncommitted dependency problem 577, 578–9, 590–1

undo operation 607, 608, 612 undone transaction 574 unfederated multidatabase systems 699 unicode compression property in Access 232–3

Unified Modeling Language (UML) 288, 894

OODBMS design with 836– 44 UML diagrams 837– 42 usage of UML 842– 4 Uniform Resource Identifiers (URIs) 1002 Uniform Resource Locators (URL) 1002–4

Uniform Resource Names (URNs) 1002 unilateral abort 746

union 92, 102

of tables in SQL 147, 148 union operations in relational algebra 642 uniqueness of candidate key 78 Universal Discovery, Description and Integration (UDDI) 1088–91 universal object storage standards 899 Universe of Discourse (UoD) 44

University Accommodation Office case

study 1255–8

data requirements 1255–7 query transactions 1257–8 unnormalized form (UNF) 402, 403 unnormalized table 403

unordered (heap) files 1270, 1288 unpinned data page 609 unsafe expressions 107 unstructured complex objects 825 unstructured interviews 317 unsupervised learning approach to database segmentation 1236 update anomalies 391

update-anywhere ownership 784, 787–8 UPDATE in SQL 117, 152–3 update of data 48

restrictions in SQL 186 update query 217–20 update transactions 301 Upper-CASE tools 307 use case diagrams in UML 838–9, 840 user-accessible catalog 48–9

user-defined data types in Oracle 978–83 user-defined routines in SQL 953–5 user-defined types in SQL 948–51 user-defined words in SQL 116 user interface design 301–3 user-level security in Microsoft Office Access 555–8

user transactions

in conceptual design 456–8, 1327

in logical design 474, 1329 user views

in database planning 287

in Dreamhome case study 336–7

in physical design 515–16, 1331 users in Oracle 247

utility services 52 validation phase of optimistic concurrency control protocol 602

validation property in Access 232 validation rules 235–6

validation techniques in normalization 389

VB.net 304 VBScript 1012–13 Versant OODBMS 834, 850 version history 872 version management 872 versionable classes 873 versions 872–3 vertical fragmentation

in distributed query optimization 764 –5

of DDBMS 708, 713–15 vertical partitioning 529 view maintenance 187 view materialization 176, 186–7 view mechanism 18

Trang 7

secure electronic transactions 566 secure sockets layer 565–6 server, extending 1020–1 services 1004 –5 static and dynamic pages 1004 Web-based database solutions 228 Web data in Oracle Warehouse Builder 1199

Web services 1004 –5 Web Services Description Language (WSDL) 1088

Web sites interactive and dynamic 808 well-formed formula 104

Wellmeadows Hospital case study 1260–7

data requirements 1260–6 transaction requirements 1266–7 wide area networks 61, 700 width-balanced histograms in query optimization 677

wildcard characters 202 windowing calculations in OLAP 1223– 4 windows in Oracle 268

Windows NT 63 Wireless Application Protocol (WAP) 702 Wisconsin benchmarking 878–9 WITH CHECK OPTION in SQL 183– 4 wizards in Office Access 226, 229 workflow ownership 784, 787 working versions 872 workload and physical database design 502

write fault 924 write phase of optimistic concurrency control protocol 602 write_timestamp 598 X/Open Distributed Transaction Processing 62

Model 758–61 XML 1073–82 advantages 1074 –6 CDATA 1078 comments 1078 and databases 1128–39 schema independent representation 1131–2

storing in an attribute 1130 storing in shredded form 1130

view resolution 176, 180–1

view serializability 583–6

testing for 584 –6

views 83–5

in DBMS 17–18

lack of in OODBMS 885

in Oracle 245

purpose of 84 –5

and security 550

in SQL 176–87

advantages 185–6

WITH CHECK OPTION 183– 4

creating 177–9

disadvantages 186

grouped and joined 179

horizontal 176–7

materialization 186–7

removing 179–80

resolution of 180–1

restrictions 181

updating 181–2

vertical 177

typed, in SQL 965–6

updating 85

virtual memory mapping architecture in

ObjectStore 923– 4

virtual relation 83

Visual Basic (VB) 40, 304

Visual FoxPro database system 25

volatile storage 606, 1268

wait-for graph (WFG) 595

warehouse manager in data warehousing

1158

weak entity type 356, 465

Web 998–1011

ActiveX security 569

integration with DBMS 1005–6

advantages of 1006–8

approaches to 1011

disadvantages of 1008–11

security on 562–9

digital certificates 564 –5

digital signatures 564

firewalls 563– 4

Java security 566–9

Kerberos 565

message digest algorithm 564

proxy servers 563

declaration 1076 document type definitions 1078–82 elements 1076–7

entity references 1077 ordering 1078 related technologies 1082–91 schema 1091–100

and SQL 1132–7 mapping functions 1135–7 new data type 1132– 4 XML Information Set 1114–15 XML Linking Language (XLink) 1086 XML Metadata Interchange (XMI) 895 XML Path Language (XPath) 1085 2.0 data model 1115–20 XML Pointer Language (XPointer) 1085–6

XML Query Languages 1100–28 formal semantics 1121–8 dynamic evaluation 1126–7 logical expressions 1127–8 normalization 1121–5 static type analysis 1125–6 information set 1114 –15 Lore and Lorel, extending 1100–1 query working group 1101–3 XQuery 1103–14

XQuery 1.0 data model 1115–20 XML schema 1091–100

built-in types 1092 cardinality 1093 constraints 1096 groups 1094 –5 lists and unions 1095–6 new types 1094 references 1093– 4 simple and complex types 1092–3

XML see eXtensible Mark-up Language

(XML) XQuery 1103–14 built-in functions and user-defined functions 1111–12

1.0 data model 1115–20 FLWOR expressions 1105–11 path expressions 1103– 4 types and sequence types 1112–14 XSL Transformations (XSLT) 1084 Yes/No data type 229

Tiêu đề	Query Processing
Trường học	University of Simpo
Chuyên ngành	Computer Science
Thể loại	Lecture Notes

Định dạng
Số trang	7
Dung lượng	1,07 MB