As in Chapter 4, we consider the optimal use of memory to be the one that minimizes the total delay due to cache misses.. The optimal allocation of memory must be the one for which the m
Trang 11 THE CASE FOR LRU
In this section, our objective is to determine the best scheme for managing memory, given that the underlying data conforms to the multiple-workload hierarchical reuse model For the present, we focus on the special case θ1 =
θ1 = = θn In this special case, we shall discover that the scheme we are looking for is, in fact, the LRUalgorithm
As in Chapter 4, we consider the optimal use of memory to be the one that minimizes the total delay due to cache misses We shall assume that a fixed
delay D 1 = D 2 = = D n = D > 0, measured in seconds, is associated with
each cache miss Also, we shall assume that all workloads share a common
stage size z 1 = z 2 = = z n = z > 0 We continue to assume, as in the
remainder of the book, that the parameter θ lies in the range 0 < θ < 1 Finally,
we shall assume that all workloads are non-trivial (that is, a non-zeroI/O rate
is associated with every workload) The final assumption is made without loss
of generality, since clearly there is no need to allocate any cache memory to a workload for which no requests must be serviced
We begin by observing that for any individual workload, data items have
corresponding probabilities of being requested that are in descending order of
the time since the previous request, due to (1.3) Therefore, for any individual workload, the effect of managing that workload’s memory via the LRU mech-anism is to place into cache memory exactly those data items which have the highest probabilities of being referenced next This enormously simplifies our task, since we know how to optimally manage any given amount of memory
assigned for use by workload i We must still, however, determine the best trade-off of memory among the n workloads.
The optimal allocation of memory must be the one for which the marginal benefit (reduction of delays), per unit of added cache memory, is the same for all workloads Otherwise, we could improve performance by taking memory away from the workload with the smallest marginal benefit and giving it to the workload with the largest benefit At least in concept, it is not difficult to pro-duce an allocation of memory with the same marginal benefit for all workloads, since, by the formula obtained in the immediately following paragraph, the marginal benefit for each workload is a strict monotonic decreasing function of its memory We need only decide on some specific marginal benefit, and add (subtract) memory to (from) each workload until the marginal benefit reaches the adopted level This same conceptual experiment also shows that there is
a unique optimal allocation of memory corresponding to any given marginal benefit, and, by the same token, a unique optimal allocation corresponding to any given total amount of memory
The next step, then, is to evaluate the marginal benefit of adding memory for
use by any individual workload i Using (1.23), we can write the delays due to
Trang 2Memory Management in an LRU Cache 63
misses, in units of seconds of delay per second of clock time, as:
(5.1) Therefore, the marginal reduction of delays with added memory is:
by (1.21) Thus, we may conclude, by (1.12), that the marginal benefit of added memory is:
(5.2)
But, for the purpose of the present discussion, we are assuming that all workloads share the same, common workload parameters θ, D, and z To
achieve optimal allocation, then, we must cause all of the workloads to share,
as well, a common value τ1 = τ2 = = τn = τ for the single-reference residency time Only in this way can we have θ1D1/z1τ1 = θ2D2/z2τ2 =
As we have seen, exactly this behavior is accomplished by applying global LRU management A global LRUpolicy enforces LRU management of each individual workload's memory, while also causing all of the workloads to share the same, common single-reference residency time For the special case
θ1 = θ2 = = θn, LRUmanagement of cache memory is therefore optimal
In the assumptions stated at the beginning of the section, we excluded those cases, such as a complete lack of I/O, in which any allocation of memory is as good as any other Thus, we can also state the conclusion just presented as follows: a memory partitioned by workload can perform as well as the same memory managed globally, only if the sizes of the partitions match with the allocations produced via global LRUmanagement
Our ability to gain insight into the impact of subdivided cache memory
is of some practical importance, since capacity planners must often examine = θn D n /z nτn = θD/zτ
Trang 3the possibility of dividing a workload among multiple storage subsystems In many cases there are compelling reasons for dividing a workload; for example, multiple subsystems may be needed to meet the total demand for storage, cache, and/or I/O throughput But we have just seen that if such a strategy
is implemented with no increase in total cache memory, compared with that provided with a single subsystem, then it may, as a side effect, cause some increase in the I/O delays due to cache misses By extending the analysis developed so far, it is possible to develop a simple estimate of this impact, at
least in the interesting special case in which a single workload is partitioned into n pequal cache memories, and the I/Orate does not vary too much between partitions
We begin by using (5.1) as a starting point However, we now specialize our
previous notation A single workload, with locality characteristics described
by the parameters b, θ, z, and D, is divided into n p equal cache memories,
each of size s p = s/n p We shall assume that each partition i = 1,2, , n p
has a corresponding I/O rate r i (that is, different partitions of the workload are assumed to vary only in their I/O rates, but not in their cache locality characteristics) These changes in notation result in the following, specialized version of (5.1):
(5.3)
Our game plan will be to compare the total delays implied by (5.3) with the delays occurring in a global cache with the same total amount of memory
s = n p s p For the global cache, with I/O rate r, the miss ratio m is given by
(1.23):
where r- = r/n pis the average I/Orate per partition Therefore, we can express the corresponding total delays due to to misses, for the global cache, as
(5.4)
Turning again to the individual partitions, it is helpful to use the average partitionI/Orate r- as a point of reference Thus, we normalize the individual
partitionI/Orates relative to :
(5.5)
Trang 4Memory Management in an LRU Cache 65
where
Our next step is to manipulate the right side of (5.5) by applying a binomial expansion This technique places limits on the variations in partition I/O rates that we are able to take into account At a minimum we must have |δi| < 1
for i = 1, 2, , n pin order for the binomial expansion to be valid; for mathematical convenience, we shall also assume that the inequality is a strong one
Provided, then, that the partition I/Orates do not vary by too much from their average value, we may apply the binomial theorem to obtain
Using this expression to substitute into (5.3), the I/Odelays due to misses in
partition i are therefore given by:
where we have used (5.4) to obtain the second expression
Taking the sum of these individual partition delays, we obtain a total of:
But it is easily shown from the definition of the quantities δithat
and
where Var[.] refers to the sample variance across partitions; that is,
Trang 5Since the term involving the sample variance is always non-negative, the
total delay can never be less than Drm (the total delay of the global cache) If
we now let
be the weighted average miss ratio of the partitioned cache, weighted by I/O
rate, then we can restate our conclusion in terms of the average delay per I/O:
(5.6) where the relative “penalty” due to partitioning, is given by:
In applying (5.6), it should be noted that the value of is not affected if all theI/Orates are scaled using a multiplicative constant Thus, we may choose
to express the partition I/Orates as events per second, as fractions of the total
load, or even as fractions of the largest load among the n ppartitions
A “rule of thumb” that is sometimes suggested is that, on average, two storage subsystems tend to divide the total I/Orate that they share in a ratio of
60 percent on one controller, 40 percent on the other This guestimate provides
an interesting illustration of (5.6)
Suppose that both subsystems, in the rule of thumb, have the same amount
of cache memory and the same workload characteristics Let us apply (5.6)
to assess the potential improvement in cache performance that might come from consolidating them into a single subsystem with double the amount of cache memory possessed by either separately Since we do not know the actual
I/O rates, and recalling that we may work in terms of fractions of the total
load, we proceed by setting r 1 and r 2to values of 4 and 6 respectively The sample variance of these two quantities is (.12 + 12)/(2–1) = 02 Assuming
θ = 0.25, we thus obtain ≈1/2x1/2x (.25/.752) x (.02/.52)≈ 009
Based upon the calculation just presented, we conclude that the improvement
in cache performance from consolidating the two controllers would be very slight (the delay per I/Odue to cache misses would be reduced by less than one percent) From a practical standpoint, this means that the decision on whether
to pursue consolidation should be based on other considerations, not dealt with
in the present analysis Such considerations would include, for example, the cost of the combined controller, and its ability to deliver the needed storage and
I/Othroughput