Distributed Database Management Systems: Lecture 30. The main topics covered in this chapter include: basic concepts of query optimization; QP in centralized and distributed DBs; query processor transforms complex queries into concise and simple ones;...
Trang 1Distributed Database Management Systems
Lecture 30
Trang 2In the previous lecture
based CC
Trang 5• Query processing is
critical performance issue
problem specially in DDBS environment
Trang 6• Main function of QP is to
transform an SQL query into equivalent relational algebra one (low level
language)
• Transformation must
achieve correctness and efficiency
Trang 8• Considering the tables
• EMP(eNo, eName, title)
• ASG(eNo, pNo, resp, dur)
• PROJ(pNo, pName,
budget, loc)
• Query: Get the names of
employees who are
managing a project
Trang 9• SELECT eName
FROM EMP, ASG
WHERE EMP.eNo = ASG.eNo
AND resp = ‘Manager’
Trang 10eName ( resp=‘Manager’ ^ EMP.eNo =
ASG.eNo) (EMPxASG)
eName (EMP ⋈ ( resp=‘Manager’
(ASG)))
• Obviously second one needs
less computing resources
since avoids Cartesian product
Trang 12• Same query in DDBS
• Suppose EMP and ASG
are HF as
• EMP1 = eNo ≤ ‘E3’ (EMP)
• EMP2 = eNo > ‘E3’ (EMP)
• ASG1 = eNo ≤ ‘E3’ (ASG)
• ASG2 = eNo > ‘E3’ (ASG)
Trang 13• Further suppose these
fragments are stored
at site 1, 2, 3 and 4
and result at site 5
Trang 14ASC1’= resp = ‘Manager(ASG1)
EMP1’=EMP1 ⋈ (ASG1’)
Site 1
Site 3
ASC2’= resp = ‘Manager(ASG2)
EMP2’=EMP2 ⋈ (ASG2’)
Trang 15result = (EMP1 U EMP2) ⋈ eNo
resp = ‘Manager’ (ASG1 U ASG2)
Site 1 Site 2 Site 3 Site 4
ASG1 ASG2 EMP1 EMP2
Trang 16Lets Assume
• size(EMP)
• size(ASG)
400 1000
• tuple access cost
• tuple transfer cost
1 unit
10 units
• There are 20 Managers
• Data distributed evenly at all
sites
Trang 20• Communication Cost will
dominate in WAN
• Not that dominant in
LANs, so total cost
should be considered in LANs
• QO can also maximize
throughput
Trang 21Operators’ Complexity
• Select, Project (without
duplicate elimination) O(n)
• Project (with duplicate
elimination), Group O(nlogn)
• Join, Semi-Join,
Division, Set Operators O(nlog n)
• Cartesian Product O(n2 )
Trang 22Characterization of Query Processors
Trang 23• Types of Optimization
–Exhaustive search for the
cost of each strategy to find the most optimal one
–May be very costly in case of
multiple options and more
fragments
–Heuristics
Trang 24• Optimization Timing
• Size of intermediate tables not
known always
• Cost justified with repeated
execution
• Intermediate tables’ size known
• Re-optimzation may be required
Trang 25• Statistics
–Relation/Fragment:
Cardinality, size of a tuple,
fraction of tuples participating
in a join with another relation
–Attribute: cardinality of
domain, actual number of
distinct values
Trang 26• Decision Sites
–Centralized: simple, need
knowledge about the entire
distributed database
–Distributed: cooperation among
sites to determine the schedule, need only local information
–Hybrid: one site determines the
global schedule, each site
optimizes the local subqueries
Trang 27• Other factors like:
Trang 28SQL Query on Distributed Relations
QUERY DECOMPOSITION GLOBALSCHEMAAlgebraic Query on Distributed
Relations
DATA LOCALIZATION FRAGMENTSCHEMAFragment Query
GLOBAL OPTIMIZATION STAT OFFRAGMENTSOptimized Fragment Query with
Communication Operations LOCAL
OPTIMIZATION SCHEMALOCAL
Optimized Local Query