Distributed Database Management Systems: Lecture 33. The main topics covered in this chapter include: data localization for hybrid fragmentation; query optimization; HyF contains both types of fragmentations; QO refers to producing a query execution plan (QEP) that represents execution strategy;...
Trang 1Distributed Database Management Systems
Lecture 33
Trang 2In the previous lecture
• Final phase of QD
• Data Localization: for HF,
VF and DF.
Trang 3In today’s Lecture
• Data Localization for
Hybrid Fragmentation
• Query Optimization.
Trang 4Reduction for HyF
• HyF contains both types of
Fragmentations
• EMP1= eNo ≤ E4 ( eNo, eName (EMP))
• EMP2= eNo > E4 ( eNo, eName (EMP))
• EMP3= eNo, title (EMP).
Trang 5• Select eName from EMP
Trang 6Summary of what we
have done so far
Trang 7• Data Localization: applies
global query to fragments;
increases optimization
Trang 8level-• So, next is the cost-based
Trang 10• Components of Optimizer
• Search Space: set of eq
alternative exec plans
• Cost Model: predicts cost
of a execution plan
• Search Strategy:
produces best plan
Trang 11Search Space
• Search space consists of
eq Query Trees
produced using Tr Rules
• Optimizer concentrates
on join trees, since join
cost is the most effective
Trang 12• Example:
• Select eName, resp
From EMP, ASG, PROJ where EMP.eNo = ASG eNo and ASG.pNo =
PROJ.pNo
Trang 14• Alternatives with N
relations are O(N!)
based on properties of relations
• So, restrictions are
applied
Trang 151- Heuristics
- Selection and
projection on base relations
- Avoid Cartesian
product
Trang 162- Shape of Tree
- Linear Tree: At least one
node for each operand is
a base relation
- Bushy tree: May have
operators with interm
tables only; allows
parallel execution
Trang 17Search Strategy
• Most popular is Dynamic
Programming
• That starts with base
relations and keeps on
adding relations calculating cost
Trang 18• DP is almost exhaustive
so produces best plan
• Too expensive with more
Trang 19Cost Model
• Cost of operators, statistics
of base data to predict size
of intermediate tables
• Cost considered as Total
Time and Response Time.
Trang 20• Total time = CPU time +
I/O time + tr time
• In WAN, major cost is tr
time
• Initially ratios were 20:1
for tr and I/O, for LAN it
is 1:1.6
Trang 21• Response time = CPU
time + I/O time + tr
time
• Difference.?
Trang 22• TCPU = time for a CPU inst
• TI/O = a disk I/O
• TMSG = fixed time for
initiating and recv a msg
• TTR = transmit a data unit from one site to another
Trang 23• TT = 2TMSG + TTR*(x+y)
• RT = max{TMSG + TTR*X,
TMSG + TTR*Y}
Site 1 Site 2
Site 3
X units
Y units
Trang 24Database Statistics
• Major factor is interm tabs
• If the interm results are to
Trang 25• For each relation R[A1, A2, …, A n]
fragmented as R1, …, R r
1.length of each attribute: length(A i)
2 the number of distinct values for
each attribute in each fragment:
card( Ai (R j))
3 maximum and minimum values in
the domain of each attribute:
min(A i ), max(A i).
Trang 264.The cardinalities of each
domain: card(dom[A i])
and the cardinalities of
each fragment: card(R j)
5.Join selectivity factor for
some of the relations
SF J (R,S) = card(R ⋈
Trang 27card(S))-Cardinalities of Intermediate Results
Trang 29• SFS(A < value) = max(A) – value
/(max(A) – min(A))
• SFS(p(Ai) ^ p(Aj)) = SFS(p(Ai)) *
(SFSp(Aj))
• SFS(p(Ai) v p(Aj)) = SFS(p(Ai)) +
SFS(p(Aj))–(SFS(p(Ai))* SFS(p(Ai)))
Trang 30Cardinality of Projection
• Hard to determine precisely
• Two cases when it is trivial
1- When a single attribute A,
card( A(R)) = card (A)
2- When PK is included
card( A(R)) = card (R)
Trang 32• Semi Join:
SFSJ(R ⋉AS)= card( A(S))/ card(dom[A])
card(R ⋉AS) = SFSJ(S.A) *
card(R).
Trang 33• Union: Hard to estimate
• Limits possible which are
card(R) + card(S) and
max{card (R) + card (S))
• Difference: Like Union,
card (R) for (R-S), and 0
Trang 34Centralized Query
Optimization