THE FRACTAL STRUCTURE OF DATA REFERENCE- P16 pptx

As in Chapter 4, we consider the optimal use of memory to be the one that minimizes the total delay due to cache misses.. The optimal allocation of memory must be the one for which the m

Trang 1

1 THE CASE FOR LRU

In this section, our objective is to determine the best scheme for managing memory, given that the underlying data conforms to the multiple-workload hierarchical reuse model For the present, we focus on the special case θ1 =

θ1 = = θn In this special case, we shall discover that the scheme we are looking for is, in fact, the LRUalgorithm

As in Chapter 4, we consider the optimal use of memory to be the one that minimizes the total delay due to cache misses We shall assume that a fixed

delay D 1 = D 2 = = D n = D > 0, measured in seconds, is associated with

each cache miss Also, we shall assume that all workloads share a common

stage size z 1 = z 2 = = z n = z > 0 We continue to assume, as in the

remainder of the book, that the parameter θ lies in the range 0 < θ < 1 Finally,

we shall assume that all workloads are non-trivial (that is, a non-zeroI/O rate

is associated with every workload) The final assumption is made without loss

of generality, since clearly there is no need to allocate any cache memory to a workload for which no requests must be serviced

We begin by observing that for any individual workload, data items have

corresponding probabilities of being requested that are in descending order of

the time since the previous request, due to (1.3) Therefore, for any individual workload, the effect of managing that workload’s memory via the LRU mech-anism is to place into cache memory exactly those data items which have the highest probabilities of being referenced next This enormously simplifies our task, since we know how to optimally manage any given amount of memory

assigned for use by workload i We must still, however, determine the best trade-off of memory among the n workloads.

The optimal allocation of memory must be the one for which the marginal benefit (reduction of delays), per unit of added cache memory, is the same for all workloads Otherwise, we could improve performance by taking memory away from the workload with the smallest marginal benefit and giving it to the workload with the largest benefit At least in concept, it is not difficult to pro-duce an allocation of memory with the same marginal benefit for all workloads, since, by the formula obtained in the immediately following paragraph, the marginal benefit for each workload is a strict monotonic decreasing function of its memory We need only decide on some specific marginal benefit, and add (subtract) memory to (from) each workload until the marginal benefit reaches the adopted level This same conceptual experiment also shows that there is

a unique optimal allocation of memory corresponding to any given marginal benefit, and, by the same token, a unique optimal allocation corresponding to any given total amount of memory

The next step, then, is to evaluate the marginal benefit of adding memory for

use by any individual workload i Using (1.23), we can write the delays due to

Trang 2

Memory Management in an LRU Cache 63

misses, in units of seconds of delay per second of clock time, as:

(5.1) Therefore, the marginal reduction of delays with added memory is:

by (1.21) Thus, we may conclude, by (1.12), that the marginal benefit of added memory is:

(5.2)

But, for the purpose of the present discussion, we are assuming that all workloads share the same, common workload parameters θ, D, and z To

achieve optimal allocation, then, we must cause all of the workloads to share,

as well, a common value τ1 = τ2 = = τn = τ for the single-reference residency time Only in this way can we have θ1D1/z1τ1 = θ2D2/z2τ2 =

As we have seen, exactly this behavior is accomplished by applying global LRU management A global LRUpolicy enforces LRU management of each individual workload's memory, while also causing all of the workloads to share the same, common single-reference residency time For the special case

θ1 = θ2 = = θn, LRUmanagement of cache memory is therefore optimal

In the assumptions stated at the beginning of the section, we excluded those cases, such as a complete lack of I/O, in which any allocation of memory is as good as any other Thus, we can also state the conclusion just presented as follows: a memory partitioned by workload can perform as well as the same memory managed globally, only if the sizes of the partitions match with the allocations produced via global LRUmanagement

Our ability to gain insight into the impact of subdivided cache memory

is of some practical importance, since capacity planners must often examine = θn D n /z nτn = θD/zτ

Trang 3

the possibility of dividing a workload among multiple storage subsystems In many cases there are compelling reasons for dividing a workload; for example, multiple subsystems may be needed to meet the total demand for storage, cache, and/or I/O throughput But we have just seen that if such a strategy

is implemented with no increase in total cache memory, compared with that provided with a single subsystem, then it may, as a side effect, cause some increase in the I/O delays due to cache misses By extending the analysis developed so far, it is possible to develop a simple estimate of this impact, at

least in the interesting special case in which a single workload is partitioned into n pequal cache memories, and the I/Orate does not vary too much between partitions

We begin by using (5.1) as a starting point However, we now specialize our

previous notation A single workload, with locality characteristics described

by the parameters b, θ, z, and D, is divided into n p equal cache memories,

each of size s p = s/n p We shall assume that each partition i = 1,2, , n p

has a corresponding I/O rate r i (that is, different partitions of the workload are assumed to vary only in their I/O rates, but not in their cache locality characteristics) These changes in notation result in the following, specialized version of (5.1):

(5.3)

Our game plan will be to compare the total delays implied by (5.3) with the delays occurring in a global cache with the same total amount of memory

s = n p s p For the global cache, with I/O rate r, the miss ratio m is given by

(1.23):

where r- = r/n pis the average I/Orate per partition Therefore, we can express the corresponding total delays due to to misses, for the global cache, as

(5.4)

Turning again to the individual partitions, it is helpful to use the average partitionI/Orate r- as a point of reference Thus, we normalize the individual

partitionI/Orates relative to :

(5.5)

Trang 4

Memory Management in an LRU Cache 65

where

Our next step is to manipulate the right side of (5.5) by applying a binomial expansion This technique places limits on the variations in partition I/O rates that we are able to take into account At a minimum we must have |δi| < 1

for i = 1, 2, , n pin order for the binomial expansion to be valid; for mathematical convenience, we shall also assume that the inequality is a strong one

Provided, then, that the partition I/Orates do not vary by too much from their average value, we may apply the binomial theorem to obtain

Using this expression to substitute into (5.3), the I/Odelays due to misses in

partition i are therefore given by:

where we have used (5.4) to obtain the second expression

Taking the sum of these individual partition delays, we obtain a total of:

But it is easily shown from the definition of the quantities δithat

and

where Var[.] refers to the sample variance across partitions; that is,

Trang 5

Since the term involving the sample variance is always non-negative, the

total delay can never be less than Drm (the total delay of the global cache) If

we now let

be the weighted average miss ratio of the partitioned cache, weighted by I/O

rate, then we can restate our conclusion in terms of the average delay per I/O:

(5.6) where the relative “penalty” due to partitioning, is given by:

In applying (5.6), it should be noted that the value of is not affected if all theI/Orates are scaled using a multiplicative constant Thus, we may choose

to express the partition I/Orates as events per second, as fractions of the total

load, or even as fractions of the largest load among the n ppartitions

A “rule of thumb” that is sometimes suggested is that, on average, two storage subsystems tend to divide the total I/Orate that they share in a ratio of

60 percent on one controller, 40 percent on the other This guestimate provides

an interesting illustration of (5.6)

Suppose that both subsystems, in the rule of thumb, have the same amount

of cache memory and the same workload characteristics Let us apply (5.6)

to assess the potential improvement in cache performance that might come from consolidating them into a single subsystem with double the amount of cache memory possessed by either separately Since we do not know the actual

I/O rates, and recalling that we may work in terms of fractions of the total

load, we proceed by setting r 1 and r 2to values of 4 and 6 respectively The sample variance of these two quantities is (.12 + 12)/(2–1) = 02 Assuming

θ = 0.25, we thus obtain ≈1/2x1/2x (.25/.752) x (.02/.52)≈ 009

Based upon the calculation just presented, we conclude that the improvement

in cache performance from consolidating the two controllers would be very slight (the delay per I/Odue to cache misses would be reduced by less than one percent) From a practical standpoint, this means that the decision on whether

to pursue consolidation should be based on other considerations, not dealt with

in the present analysis Such considerations would include, for example, the cost of the combined controller, and its ability to deliver the needed storage and

I/Othroughput

Tiêu đề	The Fractal Structure Of Data Reference
Trường học	Standard University
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Standard City

Định dạng
Số trang	5
Dung lượng	147,34 KB