Computer organization and design Design 2nd phần 5 pptx

In addition to giving us the trends that highlight the importance of the memoryhierarchy, Chapter 1 gives us a formula to evaluate the effectiveness of the mem-ory hierarchy: Memory stal

Trang 1

358 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism

years Nicolau and Fisher [1984] published a paper based on their work withtrace scheduling and asserted the presence of large amounts of potential ILP inscientific programs

Since then there have been many studies of the available ILP Such studieshave been criticized since they presume some level of both hardware support andcompiler technology Nonetheless, the studies are useful to set expectations aswell as to understand the sources of the limitations Wall has participated in sev-eral such strategies, including Jouppi and Wall [1989], Wall [1991], and Wall[1993] While the early studies were criticized as being conservative (e.g., theydidn’t include speculation), the latest study is by far the most ambitious study ofILP to date and the basis for the data in section 4.8 Sohi and Vajapeyam [1989]give measurements of available parallelism for wide-instruction-word processors.Smith, Johnson, and Horowitz [1989] also used a speculative superscalar proces-sor to study ILP limits At the time of their study, they anticipated that the proces-sor they specified was an upper bound on reasonable designs Recent andupcoming processors, however, are likely to be at least as ambitious as their pro-cessor Most recently, Lam and Wilson [1992] have looked at the limitations im-posed by speculation and shown that additional gains are possible by allowingprocessors to speculate in multiple directions, which requires more than one PC.Such ideas represent one possible alternative for future processor architectures,since they represent a hybrid organization between a conventional uniprocessorand a conventional multiprocessor

Recent Advanced Microprocessors

The years 1994–95 saw the announcement of a wide superscalar processor (3 ormore issues per clock) by every major processor vendor: Intel P6, AMD K5, SunUltraSPARC, Alpha 21164, MIPS R10000, PowerPC 604/620, and HP 8000 In

1995, the trade-offs between processors with more dynamic issue and tion and those with more static issue and higher clock rates remains unclear Inpractice, many factors, including the implementation technology, the memoryhierarchy, the skill of the designers, and the type of applications benchmarked, allplay a role in determining which approach is best Figure 4.60 shows some of themost interesting recent processors, their characteristics, and suggested referenc-

specula-es What is clear is that some level of multiple issue is here to stay and will be cluded in all processors in the foreseeable future

Trang 2

in-4.11 Historical Perspective and References 359

Issue capabilities

SPEC (measure or estimate) Processor

Year

shipped in

systems

Initial clock rate (MHz)

Issue structure

ing

Schedul- mum

Maxi- store

pha 21064

150 FP Super-

SPARC

85 FP IBM

Pentium

65 FP DEC

Alpha

21164

500 FP Sun

Ultra–

SPARC

305 FP

330 FP PowerPC

620

300 FP MIPS

FIGURE 4.60 Recent high-performance processors and their characteristics and suggested references For the last

seven systems (starting with the UltraSPARC), the SPEC numbers are estimates, since no system has yet shipped Issue structure refers to whether the hardware (dynamic) or compiler (static) is responsible for arranging instructions into issue packets; scheduling similarly describes whether the hardware dynamically schedules instructions or not To read more about these processors the following references are useful: IBM Journal of Research and Development (contains issues on Power and PowerPC designs), the Digital Technical Journal (contains issues on various Alpha processors), and Proceedings of the Hot Chips Symposium (annual meeting at Stanford, which reviews the newest microprocessors)

Trang 3

References

A GERWALA , T AND J C OCKE [1987] “High performance reduced instruction set processors,” IBM Tech Rep (March).

A NDERSON , D W., F J S PARACIO , AND R M T OMASULO [1967] “The IBM 360 Model 91:

Processor philosophy and instruction handling,” IBM J Research and Development 11:1 (January),

8–24.

B AKOGLU , H B., G F G ROHOSKI , L E T HATCHER , J A K AHLE , C R M OORE , D P T UTTLE , W E.

M AULE , W R H ARDELL , D A H ICKS , M N GUYEN PHU , R K M ONTOYE , W T G LOVER , AND S.

D HAWAN [1989] “IBM second-generation RISC processor organization,” Proc Int’l Conf on Computer Design, IEEE (October), Rye, N.Y., 138–142.

C HARLESWORTH , A E [1981] “An approach to scientific array processing: The architecture design

of the AP-120B/FPS-164 family,” Computer 14:9 (September), 18–27.

C OLWELL , R P., R P N IX , J J O’D ONNELL , D B P APWORTH , AND P K R ODMAN [1987] “A

VLIW architecture for a trace scheduling compiler,” Proc Second Conf on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif.,

180–192.

D EHNERT , J C., P Y.-T H SU , AND J P B RATT [1989] “Overlapped loop support on the Cydra 5,”

Proc Third Conf on Architectural Support for Programming Languages and Operating Systems

(April), IEEE/ACM, Boston, 26–39.

D IEP , T A., C N ELSON , AND J P S HEN [1995] “Performance evaluation of the PowerPC 620

micro-architecture,” Proc 22th Symposium on Computer Architecture (June), Santa Margherita, Italy

D ITZEL , D R AND H R M C L ELLAN [1987] “Branch folding in the CRISP microprocessor:

Reduc-ing the branch delay to zero,” Proc 14th Symposium on Computer Architecture (June), Pittsburgh,

2–7.

E LLIS, J R [1986] Bulldog: A Compiler for VLIW Architectures, MIT Press, Cambridge, Mass.

F ISHER, J A [1981] “Trace scheduling: A technique for global microcode compaction,” IEEE Trans on Computers 30:7 (July), 478– 490.

F ISHER, J A [1983] “Very long instruction word architectures and ELI-512,” Proc Tenth sium on Computer Architecture (June), Stockholm, 140– 150.

Sympo-F ISHER , J A., J R E LLIS , J C R UTTENBERG , AND A N ICOLAU [1984] “Parallel processing: A smart

compiler and a dumb processor,” Proc SIGPLAN Conf on Compiler Construction (June), Palo

Alto, Calif., 11 – 16.

F ISHER , J A AND S M F REUDENBERGER [1992] “Predicting conditional branches from previous

runs of a program,” Proc Fifth Conf on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (October), Boston, 85-95.

F ISHER , J A AND B R R AU [1993] Journal of Supercomputing (January), Kluwer.

F OSTER , C C AND E M R ISEMAN [1972] “Percolation of code to enhance parallel dispatching and

execution,” IEEE Trans on Computers C-21:12 (December), 1411–1415.

H SU, P Y.-T [1994] “Designing the TFP microprocessor,” IEEE Micro 14:2, 23–33.

H WU , W.-M AND Y P ATT [1986] “HPSm, a high performance restricted data flow architecture

having minimum functionality,” Proc 13th Symposium on Computer Architecture (June), Tokyo,

297–307.

IBM [1990] “The IBM RISC System/6000 processor,” collection of papers, IBM J Research and Development 34:1 (January), 119 pages

J OHNSON, M [1990] Superscalar Microprocessor Design, Prentice Hall, Englewood Cliffs, N.J

J OUPPI , N P AND D W W ALL [1989] “Available instruction-level parallelism for superscalar and

superpipelined processors,” Proc Third Conf on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 272–282.

Trang 4

4.11 Historical Perspective and References 361

L AM , M [1988] “Software pipelining: An effective scheduling technique for VLIW processors,”

SIGPLAN Conf on Programming Language Design and Implementation, ACM (June), Atlanta,

Ga., 318–328.

L AM , M S AND R P W ILSON [1992] “Limits of control flow on parallelism,” Proc 19th sium on Computer Architecture (May), Gold Coast, Australia, 46– 57.

Sympo-M AHLKE , S A., W Y C HEN , W.-M H WU , B R R AU , AND M S S CHLANSKER [1992] “Sentinel

scheduling for VLIW and superscalar processors,” Proc Fifth Conf on Architectural Support for Programming Languages and Operating Systems (October), Boston, IEEE/ACM, 238–247.

M C F ARLING, S [1993] “Combining branch predictors,” WRL Technical Note TN-36 (June), Digital

Western Research Laboratory, Palo Alto, Calif.

M C F ARLING , S AND J H ENNESSY [1986] “Reducing the cost of branches,” Proc 13th Symposium

on Computer Architecture (June), Tokyo, 396–403.

N ICOLAU , A AND J A F ISHER [1984] “Measuring the parallelism available for very long instruction

word architectures,” IEEE Trans on Computers C-33:11 (November), 968–976.

P AN , S.-T., K S O , AND J T R AMEH [1992] “Improving the accuracy of dynamic branch prediction

using branch correlation,” Proc Fifth Conf on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (October), Boston, 76-84.

R AU , B R., C D G LAESER , AND R L P ICARD [1982] “Efficient code generation for horizontal

architectures: Compiler techniques and architectural support,” Proc Ninth Symposium on

Comput-er Architecture (April), 131– 139.

R AU , B R., D W L Y EN , W Y EN , AND R A T OWLE [1989] “The Cydra 5 departmental

supercom-puter: Design philosophies, decisions, and trade-offs,” IEEE Computers 22:1 (January), 12– 34.

R ISEMAN , E M AND C C F OSTER [1972] “Percolation of code to enhance parallel dispatching and

execution,” IEEE Trans on Computers C-21:12 (December), 1411–1415.

S MITH , A AND J L EE [1984] “Branch prediction strategies and branch-target buffer design,” puter 17:1 (January), 6–22.

Com-S MITH, J E [1981] “A study of branch prediction strategies,” Proc Eighth Symposium on Computer Architecture (May), Minneapolis, 135–148.

S MITH, J E [1984] “Decoupled access/execute computer architectures,” ACM Trans on Computer Systems 2:4 (November), 289–308.

S MITH, J E [1989] “Dynamic instruction scheduling and the Astronautics ZS-1,” Computer 22:7

(July), 21–35.

S MITH , J E AND A R P LESZKUN [1988] “Implementing precise interrupts in pipelined processors,”

IEEE Trans on Computers 37:5 (May), 562–573 This paper is based on an earlier paper that appeared in Proc 12th Symposium on Computer Architecture, June 1988

S MITH , J E., G E D ERMER , B D V ANDERWARN , S D K LINGER , C M R OZEWSKI , D L F OWLER ,

K R S CIDMORE , AND J P L AUDON [1987] “The ZS-1 central processor,” Proc Second Conf on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March),

Palo Alto, Calif., 199–204.

S MITH , M D., M H OROWITZ , AND M S L AM [1992] “Efficient superscalar performance through

boosting,” Proc Fifth Conf on Architectural Support for Programming Languages and Operating Systems (October), Boston, IEEE/ACM, 248–259.

S MITH , M D., M J OHNSON , AND M A H OROWITZ [1989] “Limits on multiple instruction issue,”

Proc Third Conf on Architectural Support for Programming Languages and Operating Systems,

IEEE/ACM (April), Boston, 290–302.

S OHI , G S [1990] “Instruction issue logic for high-performance, interruptible, multiple functional

unit, pipelined computers,” IEEE Trans on Computers 39:3 (March), 349-359.

Trang 5

S OHI , G S AND S V AJAPEYAM [1989] “Tradeoffs in instruction format design for horizontal

archi-tectures,” Proc Third Conf on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 15–25.

T HORLIN, J F [1967] “Code generation for PIE (parallel instruction execution) computers,” Proc Spring Joint Computer Conf 27.

T HORNTON, J E [1964] “Parallel operation in the Control Data 6600,” Proc AFIPS Fall Joint puter Conf., Part II, 26, 33–40.

Com-T HORNTON, J E [1970] Design of a Computer, the Control Data 6600, Scott, Foresman, Glenview,

Ill.

T JADEN , G S AND M J F LYNN [1970] “Detection and parallel execution of independent

instruc-tions,” IEEE Trans on Computers C-19:10 (October), 889–895.

T OMASULO, R M [1967] “An efficient algorithm for exploiting multiple arithmetic units,” IBM J Research and Development 11:1 (January), 25–33.

W ALL, D W [1991] “Limits of instruction-level parallelism,” Proc Fourth Conf on Architectural Support for Programming Languages and Operating Systems (April), Santa Clara, Calif., IEEE/

ACM, 248–259.

W ALL, D W [1993] Limits of Instruction-Level Parallelism, Research Rep 93/6, Western Research

Laboratory, Digital Equipment Corp (November).

W EISS , S AND J E S MITH [1984] “Instruction issue logic for pipelined supercomputers,” Proc 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 110–118.

W EISS , S AND J E S MITH [1987] “A study of scalar compilation techniques for pipelined

super-computers,” Proc Second Conf on Architectural Support for Programming Languages and ating Systems (March), IEEE/ACM, Palo Alto, Calif., 105–109.

Oper-W EISS , S AND J E S MITH [1994] Power and PowerPC, Morgan Kaufmann, San Francisco

Y EH , T AND Y N P ATT [1992] “Alternative implementations of two-level adaptive branch

prediction,” Proc 19th Symposium on Computer Architecture (May), Gold Coast, Australia, 124–

134.

Y EH , T AND Y N P ATT [1993] “A comparison of dynamic branch predictors that use two levels of

branch history,” Proc 20th Symposium on Computer Architecture (May), San Diego, 257– 266.

E X E R C I S E S 4.1 [15] <4.1> List all the dependences (output, anti, and true) in the following code frag-

ment Indicate whether the true dependences are loop-carried or not Show why the loop is not parallel.

for (i=2;i<100;i=i+1) {

a[i] = b[i] + a[i]; /* S1 */

c[i-1] = a[i] + d[i]; /* S2 */

a[i] = b[i] + c[i]; /* S1 */

b[i] = a[i] + d[i]; /* S2 */

a[i+1] = a[i] + e[i]; /* S3 */

Trang 6

Exercises 363

4.3 [10] <4.1> For the following code fragment, list the control dependences For each

control dependence, tell whether the statement can be scheduled before the if statement based on the data references Assume that all data references are shown, that all values are defined before use, and that only b and c are used again after this segment You may ignore any possible exceptions

4.4 [15] <4.1> Assuming the pipeline latencies from Figure 4.2, unroll the following loop

as many times as necessary to schedule it without any delays, collapsing the loop overhead instructions Assume a one-cycle delayed branch Show the schedule The loop computes Y[i] = a × X[i] + Y[i], the key step in a Gaussian elimination

loop: LD F0,0(R1)

MULTD F0,F0,F2

LD F4,0(R2) ADDD F0,F0,F4

SD 0(R2),F0 SUBI R1,R1,8 SUBI R2,R2,8 BNEZ R1,loop

4.5 [15] <4.1> Assume the pipeline latencies from Figure 4.2 and a one-cycle delayed

branch Unroll the following loop a sufficient number of times to schedule it without any delays Show the schedule after eliminating any redundant overhead instructions The loop

is a dot product (assuming F2 is initially 0) and contains a recurrence Despite the fact that the loop is not parallel, it can be scheduled with no delays.

loop: LD F0,0(R1)

LD F4,0(R2) MULTD F0,F0,F4 ADDD F2,F0,F2 SUBI R1,R1,#8 SUBI R2,R2,#8 BNEZ R1,loop

4.6 [20] <4.2> It is critical that the scoreboard be able to distinguish RAW and WAR

haz-ards, since a WAR hazard requires stalling the instruction doing the writing until the struction reading an operand initiates execution, while a RAW hazard requires delaying the reading instruction until the writing instruction finishes—just the opposite For example, consider the sequence:

in-MULTD F0,F6,F4 SUBD F8,F0,F2 ADDD F2,F10,F2

The SUBD depends on the MULTD (a RAW hazard) and thus the MULTD must be allowed

to complete before the SUBD ; if the MULTD were stalled for the SUBD due to the inability

to distinguish between RAW and WAR hazards, the processor will deadlock This quence contains a WAR hazard between the ADDD and the SUBD , and the ADDD cannot be

Trang 7

se-364 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism

allowed to complete until the SUBD begins execution The difficulty lies in distinguishing the RAW hazard between MULTD and SUBD , and the WAR hazard between the SUBD and ADDD

Describe how the scoreboard for a machine with two multiply units and two add units avoids this problem and show the scoreboard values for the above sequence assuming the ADDD is the only instruction that has completed execution (though it has not written its re-

sult) (Hint: Think about how WAW hazards are prevented and what this implies about

ac-tive instruction sequences.)

4.7 [12] <4.2> A shortcoming of the scoreboard approach occurs when multiple functional

units that share input buses are waiting for a single result The units cannot start neously, but must serialize This is not true in Tomasulo’s algorithm Give a code sequence that uses no more than 10 instructions and shows this problem Assume the hardware configuration from Figure 4.3, for the scoreboard, and Figure 4.8, for Tomasulo’s scheme Use the FP latencies from Figure 4.2 (page 224) Indicate where the Tomasulo approach can continue, but the scoreboard approach must stall.

simulta-4.8 [15] <4.2> Tomasulo’s algorithm also has a disadvantage versus the scoreboard: only

one result can complete per clock, due to the CDB Use the hardware configuration from Figures 4.3 and 4.8 and the FP latencies from Figure 4.2 (page 224) Find a code sequence

of no more than 10 instructions where the scoreboard does not stall, but Tomasulo’s rithm must due to CDB contention Indicate where this occurs in your sequence.

algo-4.9 [45] <4.2> One benefit of a dynamically scheduled processor is its ability to tolerate

changes in latency or issue capability without requiring recompilation This was a primary motivation behind the 360/91 implementation The purpose of this programming assignment is to evaluate this effect Implement a version of Tomasulo’s algorithm for DLX to issue one instruction per clock; your implementation should also be capable of in-order issue Assume fully pipelined functional units and the latencies shown in Figure 4.61.

A one-cycle latency means that the unit and the result are available for the next instruction Assume the processor takes a one-cycle stall for branches, in addition to any data- dependent stalls shown in the above table Choose 5–10 small FP benchmarks (with loops)

to run; compare the performance with and without dynamic scheduling Try scheduling the loops by hand and see how close you can get with the statically scheduled processor to the dynamically scheduled results

Trang 8

Exercises 365

Change the processor to the configuration shown in Figure 4.62.

Rerun the loops and compare the performance of the dynamically scheduled processor and the statically scheduled processor

4.10 [15] <4.3> Suppose we have a deeply pipelined processor, for which we implement a

branch-target buffer for the conditional branches only Assume that the misprediction alty is always 4 cycles and the buffer miss penalty is always 3 cycles Assume 90% hit rate and 90% accuracy, and 15% branch frequency How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1.

pen-4.11 [10] <4.3> Determine the improvement from branch folding for unconditional

branches Assume a 90% hit rate, a base CPI without unconditional branch stalls of 1, and

an unconditional branch frequency of 5% How much improvement is gained by this hancement versus a processor whose effective CPI is 1.1?

en-4.12 [30] <4.4> Implement a simulator to evaluate the performance of a branch-prediction

buffer that does not store branches that are predicted as untaken Consider the following prediction schemes: a one-bit predictor storing only predicted taken branches, a two-bit predictor storing all the branches, a scheme with a target buffer that stores only predicted taken branches and a two-bit prediction buffer Explore different sizes for the buffers keeping the total number of bits (assuming 32-bit addresses) the same for all schemes Deter- mine what the branch penalties are, using Figure 4.24 as a guideline How do the different schemes compare both in prediction accuracy and in branch cost?

4.13 [30] <4.4> Implement a simulator to evaluate various branch prediction schemes You

can use the instruction portion of a set of cache traces to simulate the branch-prediction buffer Pick a set of table sizes (e.g., 1K bits, 2K bits, 8K bits, and 16K bits) Determine the performance of both (0,2) and (2,2) predictors for the various table sizes Also compare the performance of the degenerate predictor that uses no branch address information for these table sizes Determine how large the table must be for the degenerate predictor to perform

as well as a (0,2) predictor with 256 entries

4.14 [20/22/22/22/22/25/25/25/20/22/22] <4.1,4.2,4.4> In this Exercise, we will look at

how a common vector loop runs on a variety of pipelined versions of DLX The loop is the so-called SAXPY loop (discussed extensively in Appendix B) and the central operation in

Trang 9

Gaussian elimination The loop implements the vector operation Y = a × X + Y for a vector

of length 100 Here is the DLX code for the loop:

foo: LD F2,0(R1) ;load X(i)

MULTD F4,F2,F0 ;multiply a*X(i)

LD F6,0(R2) ;load Y(i) ADDD F6,F4,F6 ;add a*X(i) + Y(i)

SD 0(R2),F6 ;store Y(i) ADDI R1,R1,#8 ;increment X index ADDI R2,R2,#8 ;increment Y index SGTI R3,R1,done ;test if done BEQZ R3,foo ; loop if not done

For (a)–(e), assume that the integer operations issue and complete in one clock cycle cluding loads) and that their results are fully bypassed Ignore the branch delay You will use the FP latencies shown in Figure 4.2 (page 224) Assume that the FP unit is fully pipelined.

(in-a [20] <4.1> For this problem use the standard single-issue DLX pipeline with the line latencies from Figure 4.2 Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e., enters its first EX cycle) on the first iteration of the loop How many clock cycles does each loop iteration take?

pipe-b [22] <4.1> Unroll the DLX code for SAXPY to make four copies of the body and schedule it for the standard DLX integer pipeline and a fully pipelined FPU with the

FP latencies of Figure 4.2 When unwinding, you should optimize the code as we did

in section 4.1 Significant reordering of the code will be needed to maximize mance How many clock cycles does each loop iteration take?

perfor-c [22] <4.2> Using the DLX code for SAXPY above, show the state of the scoreboard tables (as in Figure 4.4) when the SGTI instruction reaches write result Assume that issue and read operands each take a cycle Assume that there is one integer functional unit that takes only a single execution cycle (the latency to use is 0 cycles, including loads and stores) Assume the FP unit configuration of Figure 4.3 with the FP latencies

of Figure 4.2 The branch should not be included in the scoreboard.

d [22] <4.2> Use the DLX code for SAXPY above and a fully pipelined FPU with the latencies of Figure 4.2 Assume Tomasulo’s algorithm for the hardware with one integer unit taking one execution cycle (a latency of 0 cycles to use) for all integer operations Show the state of the reservation stations and register-status tables (as in Figure 4.9) when the SGTI writes its result on the CDB Do not include the branch.

e [22] <4.2> Using the DLX code for SAXPY above, assume a scoreboard with the FP functional units described in Figure 4.3, plus one integer functional unit (also used for load-store) Assume the latencies shown in Figure 4.63 Show the state of the scoreboard (as in Figure 4.4) when the branch issues for the second time Assume the branch was correctly predicted taken and took one cycle How many clock cycles does each loop iteration take? You may ignore any register port/bus conflicts.

f [25] <4.2> Use the DLX code for SAXPY above Assume Tomasulo’s algorithm for the hardware using one fully pipelined FP unit and one integer unit Assume the latencies shown in Figure 4.63.

Trang 10

Exercises 367

Show the state of the reservation stations and register status tables (as in Figure 4.9) when the branch is executed for the second time Assume the branch was correctly predicted as taken How many clock cycles does each loop iteration take?

g [25] <4.1,4.4> Assume a superscalar architecture that can issue any two independent operations in a clock cycle (including two integer operations) Unwind the DLX code for SAXPY to make four copies of the body and schedule it assuming the FP latencies

of Figure 4.2 Assume one fully pipelined copy of each functional unit (e.g., FP adder,

FP multiplier) and two integer functional units with latency to use of 0 How many clock cycles will each iteration on the original code take? When unwinding, you should optimize the code as in section 4.1 What is the speedup versus the original code?

h [25] <4.4> In a superpipelined processor, rather than have multiple functional units,

we would fully pipeline all the units Suppose we designed a superpipelined DLX that had twice the clock rate of our standard DLX pipeline and could issue any two unre- lated instructions in the same time that the normal DLX pipeline issued one operation.

If the second instruction is dependent on the first, only the first will issue Unroll the DLX SAXPY code to make four copies of the loop body and schedule it for this superpipelined processor, assuming the FP latencies of Figure 4.63 Also assume the load

to use latency is 1 cycle, but other integer unit latencies are 0 cycles How many clock cycles does each loop iteration take? Remember that these clock cycles are half as long

as those on a standard DLX pipeline or a superscalar DLX

i [20] <4.4> Start with the SAXPY code and the processor used in Figure 4.29 Unroll the SAXPY loop to make four copies of the body, performing simple optimizations (as in section 4.1) Assume all integer unit latencies are 0 cycles and the FP latencies are given in Figure 4.2 Fill in a table like Figure 4.28 for the unrolled loop How many clock cycles does each loop iteration take?

j [22] <4.2,4.6> Using the DLX code for SAXPY above, assume a speculative sor with the functional unit organization used in section 4.6 and a single integer functional unit Assume the latencies shown in Figure 4.63 Show the state of the processor (as in Figure 4.35) when the branch issues for the second time Assume the branch was correctly predicted taken and took one cycle How many clock cycles does each loop iteration take?

proces-Instruction producing result Instruction using result Latency in clock cycles

Trang 11

k [22] <4.2,4.6> Using the DLX code for SAXPY above, assume a speculative sor like Figure 4.34 that can issue one load-store, one integer operation, and one FP operation each cycle Assume the latencies in clock cycles of Figure 4.63 Show the state of the processor (as in Figure 4.35) when the branch issues for the second time Assume the branch was correctly predicted taken and took one cycle How many clock cycles does each loop iteration take?

proces-4.15 [15] <4.5> Here is a simple code fragment:

for (i=2;i<=100;i+=2)

a[i] = a[50*i+1];

To use the GCD test, this loop must first be “normalized”—written so that the index starts

at 1 and increments by 1 on every iteration Write a normalized version of the loop (change the indices as needed), then use the GCD test to see if there is a dependence

4.16 [15] <4.1,4.5> Here is another loop:

for (i=2,i<=100;i+=2)

a[i] = a[i-1];

Normalize the loop and use the GCD test to detect a dependence Is there a loop-carried, true dependence in this loop?

4.17 [25] <4.5> Show that if for two array elements A(a × i + b) and A(c × i + d) there is

a true dependence, then GCD(c,a) divides (d – b).

4.18 [15] <4.5> Rewrite the software pipelining loop shown in the Example on page 294

in section 4.5, so that it can be run by simply decrementing R1 by 16 before the loop starts.

After rewriting the loop, show the start-up and finish-up code Hint: To get the loop to run

properly when R1 is decremented, the SD should store the result of the original first

itera-tion You can achieve this by adjusting load-store offsets.

4.19 [20] <4.5> Consider the loop that we software pipelined on page 294 in section 4.5.

Suppose the latency of the ADDD was five cycles The software pipelined loop now has a stall Show how this loop can be written using both software pipelining and loop unrolling

to eliminate any stalls The loop should be unrolled as few times as possible (once is enough) You need not show loop start-up or clean-up.

4.20 [15/15] <4.6> Consider our speculative processor from section 4.6 Since the reorder

buffer contains a value field, you might think that the value field of the reservation stations could be eliminated

a [15] <4.6> Show an example where this is the case and an example where the value field of the reservation stations is still needed Use the speculative machine shown in Figure 4.34 Show DLX code for both examples How many value fields are needed

in each reservation station?

b [15] <4.6> Find a modification to the rules for instruction commit that allows nation of the value fields in the reservation station What are the negative side effects

elimi-of such a change?

Trang 12

Exercises 369

4.21 [20] <4.6> Our implementation of speculation uses a reorder buffer and introduces

the concept of instruction commit, delaying commit and the irrevocable updating of the isters until we know an instruction will complete There are two other possible implementation techniques, both originally developed as a method for preserving precise interrupts when issuing out of order One idea introduces a future file that keeps future values of a register; this idea is similar to the reorder buffer An alternative is to keep a history buffer that records values of registers that have been speculatively overwritten

reg-Design a speculative processor like the one in section 4.6 but using a history buffer Show the state of the processor, including the contents of the history buffer, for the example in Figure 4.36 Show the changes needed to Figure 4.37 for a history buffer implementation Describe exactly how and when entries in the history buffer are read and written, including what happens on an incorrect speculation

4.22 [30/30] <4.8> This exercise involves a programming assignment to evaluate what types

of parallelism might be expected in more modest, and more realistic, processors than those studied in section 4.7 These studies can be done using traces available with this text or obtained from other tracing programs For simplicity, assume perfect caches For a more ambitious project, assume a real cache To simplify the task, make the following assumptions:

■ Assume perfect branch and jump prediction: hence you can use the trace as the input

to the window, without having to consider branch effects—the trace is perfect.

■ Assume there are 64 spare integer and 64 spare floating-point registers; this is easily implemented by stalling the issue of the processor whenever there are more live registers required.

■ Assume a window size of 64 instructions (the same for alias detection) Use greedy scheduling of instructions in the window That is, at any clock cycle, pick for execu-

tion the first n instructions in the window that meet the issue constraints

a [30] <4.8> Determine the effect of limited instruction issue by performing the ing experiments:

follow-■ Vary the issue count from 4–16 instructions per clock,

■ Assuming eight issues per clock: determine what the effect of restricting the processor to two memory references per clock is

b [30] <4.8> Determine the impact of latency in instructions Assume the following latency models for a processor that issues up to 16 instructions per clock:

■ Model 1: All latencies are one clock.

■ Model 2: Load latency and branch latency are one clock; all FP latencies are two clocks.

■ Model 3: Load and branch latency is two clocks; all FP latencies are five clocks Remember that with limited issue and a greedy scheduler, the impact of latency effects will

be greater

Trang 13

4.23 [Discussion] <4.3,4.6> Dynamic instruction scheduling requires a considerable

investment in hardware In return, this capability allows the hardware to run programs that could not be run at full speed with only compile-time, static scheduling What trade-offs should be taken into account in trying to decide between a dynamically and a statically scheduled implementation? What situations in either hardware technology or program characteristics are likely to favor one approach or the other? Most speculative schemes rely

on dynamic scheduling; how does speculation affect the arguments in favor of dynamic scheduling?

4.24 [Discussion] <4.3> There is a subtle problem that must be considered when

imple-menting Tomasulo’s algorithm It might be called the “two ships passing in the night lem.” What happens if an instruction is being passed to a reservation station during the same clock period as one of its operands is going onto the common data bus? Before an instruction is in a reservation station, the operands are fetched from the register file; but once it is

prob-in the station, the operands are always obtaprob-ined from the CDB Sprob-ince the prob-instruction and its operand tag are in transit to the reservation station, the tag cannot be matched against the tag on the CDB So there is a possibility that the instruction will then sit in the reservation station forever waiting for its operand, which it just missed How might this problem be solved? You might consider subdividing one of the steps in the algorithm into multiple parts (This intriguing problem is courtesy of J E Smith.)

4.25 [Discussion] <4.4-4.6> Discuss the advantages and disadvantages of a superscalar

implementation, a superpipelined implementation, and a VLIW approach in the context of DLX What levels of ILP favor each approach? What other concerns would you consider in choosing which type of processor to build? How does speculation affect the results?

Trang 14

A W Burks, H H Goldstine, and J von Neumann

Preliminary Discussion of the Logical Design

of an Electronic Computing Instrument (1946)

Trang 15

5.1 Introduction 373

5.9 Crosscutting Issues in the Design of Memory Hierarchies 457 5.10 Putting It All Together:

of the hierarchy Note that each level maps addresses from a larger memory to asmaller but faster memory higher in the hierarchy As part of address mapping,

Trang 16

374 Chapter 5 Memory-Hierarchy Design

the memory hierarchy is given the responsibility of address checking; hence tection schemes for scrutinizing addresses are also part of the memory hierarchy The importance of the memory hierarchy has increased with advances in per-formance of processors For example, in 1980 microprocessors were often de-signed without caches, while in 1995 they often come with two levels of caches

pro-As noted in Chapter 1, microprocessor performance improved 55% per yearsince 1987, and 35% per year until 1986 Figure 5.1 plots CPU performance pro-jections against the historical performance improvement in main memory accesstime Clearly there is a processor-memory performance gap that computer archi-tects must try to close

In addition to giving us the trends that highlight the importance of the memoryhierarchy, Chapter 1 gives us a formula to evaluate the effectiveness of the mem-ory hierarchy:

Memory stall cycles = Instruction count × Memory references per instruction × Miss rate × Miss penalty

FIGURE 5.1 Starting with 1980 performance as a baseline, the performance of ory and CPUs are plotted over time The memory baseline is 64-KB DRAM in 1980, with three years to the next generation and a 7% per year performance improvement in latency (see Figure 5.30 on page 429) The CPU line assumes a 1.35 improvement per year until

mem-1986, and a 1.55 improvement thereafter Note that the vertical axis must be on a logarithmic scale to record the size of the CPU-DRAM performance gap

Year

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Trang 17

5.2 The ABCs of Caches 375

where Miss rate is the fraction of accesses that are not in the cache and Miss penalty is the additional clock cycles to service the miss Recall that a block is theminimum unit of information that can be present in the cache (hit in the cache) ornot (miss in the cache)

This chapter uses a related formula to evaluate many examples of using theprinciple of locality to improve performance while keeping the memory systemaffordable This common principle allows us to pose four questions about any

level of the hierarchy:

Q1: Where can a block be placed in the upper level? (Block placement)Q2: How is a block found if it is in the upper level? (Block identification)Q3: Which block should be replaced on a miss? (Block replacement)Q4: What happens on a write? (Write strategy)

The answers to these questions help us understand the different trade-offs ofmemories at different levels of a hierarchy; hence we ask these four questions onevery example

To put these abstract ideas into practice, throughout the chapter we showexamples from the four levels of the memory hierarchy in a computer using theAlpha AXP 21064 microprocessor Toward the end of the chapter we evaluate theimpact of these levels on performance using the SPEC92 benchmark programs

Cache: a safe place for hiding or storing things.

Webster’s New World Dictionary of the American Language, Second College Edition (1976)

Cache is the name generally given to the first level of the memory hierarchy countered once the address leaves the CPU Since the principle of locality applies

en-at many levels, and taking advantage of locality to improve performance is sopopular, the term cache is now applied whenever buffering is employed to reusecommonly occurring items; examples include file caches, name caches, and so

on We start our description of caches by answering the four common questionsfor the first level of the memory hierarchy; you’ll see similar questions andanswers later

Trang 18

Q1: Where can a block be placed in a cache?

Figure 5.2 shows that the restrictions on where a block is placed create three gories of cache organization:

cate-■ If each block has only one place it can appear in the cache, the cache is said to

be direct mapped The mapping is usually

(Block address) MOD (Number of blocks in cache)

FIGURE 5.2 This example cache has eight block frames and memory has 32 blocks.

Real caches contain hundreds of block frames and real memories contain millions of blocks The set-associative organization has four sets with two blocks per set, called two-way set associative Assume that there is nothing in the cache and that the block address in question identifies lower-level block 12 The three options for caches are shown left to right In fully associative, block 12 from the lower level can go into any of the eight block frames of the cache With direct mapped, block 12 can only be placed into block frame 4 (12 modulo 8) Set associative, which has some of both features, allows the block to be placed anywhere in set 0 (12 modulo 4) With two blocks per set, this means block 12 can be placed either in block

0 or block 1 of the cache.

Fully associative:

block 12 can go anywhere

Direct mapped:

block 12 can go only into block 4 (12 mod 8)

Set associative:

block 12 can go anywhere in set 0 (12 mod 4)

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Block

no.

Block no.

Set 0 Set 1 Set 2 Set 3

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Block

Block frame address

no.

Cache

Memory

Trang 19

■ If a block can be placed anywhere in the cache, the cache is said to be fully associative

■ If a block can be placed in a restricted set of places in the cache, the cache issaid to be set associative A set is a group of blocks in the cache A block is firstmapped onto a set, and then the block can be placed anywhere within that set.The set is usually chosen by bit selection; that is,

(Block address) MOD(Number of sets in cache)

If there are n blocks in a set, the cache placement is called n-way set associative.The range of caches from direct mapped to fully associative is really a continuum

of levels of set associativity: Direct mapped is simply one-way set associativeand a fully associative cache with m blocks could be called m-way set associa-tive; equivalently, direct mapped can be thought of as having m sets and fullyassociative as having one set The vast majority of processor caches today are di-rect mapped, two-way set associative, or four-way set associative, for reasons weshall see shortly

Q2: How is a block found if it is in the cache?

Caches have an address tag on each block frame that gives the block address Thetag of every cache block that might contain the desired information is checked tosee if it matches the block address from the CPU As a rule, all possible tags aresearched in parallel because speed is critical

There must be a way to know that a cache block does not have valid tion The most common procedure is to add a valid bit to the tag to say whether ornot this entry contains a valid address If the bit is not set, there cannot be a match

informa-on this address

Before proceeding to the next question, let’s explore the relationship of a CPUaddress to the cache Figure 5.3 shows how an address is divided The first divi-sion is between the block address and the block offset. The block frame addresscan be further divided into the tag field and the index field The block offset fieldselects the desired data from the block, the index field selects the set, and the tagfield is compared against it for a hit While the comparison could be made onmore of the address than the tag, there is no need because of the following:

■ Checking the index would be redundant, since it was used to select the set to

be checked; an address stored in set 0, for example, must have 0 in the indexfield or it couldn’t be stored in set 0

■ The offset is unnecessary in the comparison since the entire block is present ornot, and hence all block offsets must match

Trang 20

If the total cache size is kept the same, increasing associativity increases thenumber of blocks per set, thereby decreasing the size of the index and increasingthe size of the tag That is, the tag-index boundary in Figure 5.3 moves to theright with increasing associativity, with the end case of fully associative cacheshaving no index field

Q3: Which block should be replaced on a cache miss?

When a miss occurs, the cache controller must select a block to be replaced withthe desired data A benefit of direct-mapped placement is that hardware decisionsare simplified—in fact, so simple that there is no choice: Only one block frame ischecked for a hit, and only that block can be replaced With fully associative orset-associative placement, there are many blocks to choose from on a miss Thereare two primary strategies employed for selecting which block to replace:

■ Random—To spread allocation uniformly, candidate blocks are randomlyselected Some systems generate pseudorandom block numbers to get repro-ducible behavior, which is particularly useful when debugging hardware

■ Least-recently used (LRU)—To reduce the chance of throwing out informationthat will be needed soon, accesses to blocks are recorded The block replaced

is the one that has been unused for the longest time LRU makes use of a ollary of locality: If recently used blocks are likely to be used again, then thebest candidate for disposal is the least-recently used block

cor-A virtue of random replacement is that it is simple to build in hardware cor-As thenumber of blocks to keep track of increases, LRU becomes increasingly ex-pensive and is frequently only approximated Figure 5.4 shows the difference inmiss rates between LRU and random replacement

Q4: What happens on a write?

Reads dominate processor cache accesses All instruction accesses are reads,and most instructions don’t write to memory Figure 2.26 on page 105 in Chap-ter 2 suggests a mix of 9% stores and 26% loads for DLX programs, makingwrites 9%/(100% + 26% + 9%) or about 7% of the overall memory traffic and

FIGURE 5.3 The three portions of an address in a set-associative or direct-mapped cache The tag is used to check all the blocks in the set and the index is used to select the set The block offset is the address of the desired data within the block.

Block offset Block address

Trang 21

9%/(26% + 9%) or about 25% of the data cache traffic Making the commoncase fast means optimizing caches for reads, especially since processors tradi-tionally wait for reads to complete but need not wait for writes Amdahl’s Law(section 1.6, page 29) reminds us, however, that high-performance designs can-not neglect the speed of writes

Fortunately, the common case is also the easy case to make fast The block can

be read from cache at the same time that the tag is read and compared, so theblock read begins as soon as the block address is available If the read is a hit, therequested part of the block is passed on to the CPU immediately If it is a miss,there is no benefit—but also no harm; just ignore the value read

Such is not the case for writes Modifying a block cannot begin until the tag ischecked to see if the address is a hit Because tag checking cannot occur in paral-lel, writes normally take longer than reads Another complexity is that the proces-sor also specifies the size of the write, usually between 1 and 8 bytes; only thatportion of a block can be changed In contrast, reads can access more bytes thannecessary without fear

The write policies often distinguish cache designs There are two basic optionswhen writing to the cache:

■ Write through (or store through)—The information is written to both the block

in the cache and to the block in the lower-level memory

■ Write back (also called copy back or store in)—The information is written only

to the block in the cache The modified cache block is written to main memoryonly when it is replaced

To reduce the frequency of writing back blocks on replacement, a featurecalled the dirty bit is commonly used This status bit indicates whether the block

is dirty (modified while in the cache) or clean (not modified) If it is clean, the

Associativity

Trang 22

block is not written on a miss, since the lower level has identical information tothe cache

Both write back and write through have their advantages With write back,writes occur at the speed of the cache memory, and multiple writes within a blockrequire only one write to the lower-level memory Since some writes don’t go tomemory, write back uses less memory bandwidth, making write back attractive inmultiprocessors With write through, read misses never result in writes to thelower level, and write through is easier to implement than write back Writethrough also has the advantage that the next lower level has the most current copy

of the data This is important for I/O and for multiprocessors, which we examine

in Chapters 6 and 8 As we shall see, I/O and multiprocessors are fickle: theywant write back for processor caches to reduce the memory traffic and writethrough to keep the cache consistent with lower levels of the memory hierarchy.When the CPU must wait for writes to complete during write through, theCPU is said to write stall A common optimization to reduce write stalls is a write buffer, which allows the processor to continue as soon as the data is written to thebuffer, thereby overlapping processor execution with memory updating As weshall see shortly, write stalls can occur even with write buffers

Since the data are not needed on a write, there are two common options on awrite miss:

■ Write allocate (alsocalled fetch on write)—The block is loaded on a writemiss, followed by the write-hit actions above This is similar to a read miss

■ No-write allocate (also called write around)—The block is modified in thelower level and not loaded into the cache

Although either write-miss policy could be used with write through or write back,write-back caches generally use write allocate (hoping that subsequent writes to thatblock will be captured by the cache) and write-through caches often use no-writeallocate (since subsequent writes to that block will still have to go to memory)

An Example: The Alpha AXP 21064 Data Cache and Instruction Cache

To give substance to these ideas, Figure 5.5 shows the organization of the datacache in the Alpha AXP 21064 microprocessor that is found in the DEC 3000Model 800 workstation The cache contains 8192 bytes of data in 32-byte blockswith direct-mapped placement, write through with a four-block write buffer, andno-write allocate on a write miss

Let’s trace a cache hit through the steps of a hit as labeled in Figure 5.5 (Thefour steps are shown as circled numbers.) As we shall see later (Figure 5.41), the

21064 microprocessor presents a 34-bit physical address to the cache for tagcomparison The address coming into the cache is divided into two fields: the 29-bit block address and 5-bit block offset The block address is further divided into

an address tag and cache index Step 1 shows this division

Trang 23

The cache index selects the tag to be tested to see if the desired block is in thecache The size of the index depends on cache size, block size, and set associativ-ity The 21064 cache is direct mapped, so set associativity is set to one, and wecalculate the index as follows:

Hence the index is 8 bits wide, and the tag is 29 – 8 or 21 bits wide

Index selection is step 2 in Figure 5.5 Remember that direct mapping allowsthe data to be read and sent to the CPU in parallel with the tag being read andchecked

After reading the tag from the cache, it is compared to the tag portion of theblock address from the CPU This is step 3 in the figure To be sure the tag con-

FIGURE 5.5 The organization of the data cache in the Alpha AXP 21064 sor The 8-KB cache is direct mapped with 32-byte blocks It has 256 blocks selected by the 8-bit index The four steps of a read hit, shown as circled numbers in order of occurrence, label this organization Although we show a 4:1 multiplexer to select the desired 8 bytes, in reality the data RAM is organized 8 bytes wide and the multiplexer is unnecessary: 2 bits of the block offset join the index to supply the RAM address to select the proper 8 bytes (see Figure 5.8) Although not exercised in this example, the line from memory to the cache is used on a miss to load the cache

1

Write buffer

Lower level memory

Tag

<21>

4:1 Mux

2index Cache size

Block size × Set associativity

Trang 24

tains valid information, the valid bit must be set or else the results of the son are ignored

compari-Assuming the tag does match, the final step is to signal the CPU to load thedata from the cache The 21064 allows two clock cycles for these four steps, sothe instructions in the following two clock cycles would stall if they tried to usethe result of the load

Handling writes is more complicated than handling reads in the 21064, as it is

in any cache If the word to be written is in the cache, the first three steps are thesame After the tag comparison indicates a hit, the data are written (Section 5.5shows how the 21064 avoids the extra time on write hits that this descriptionimplies.)

Since this is a write-through cache, the write process isn’t yet over The dataare also sent to a write buffer that can contain up to four blocks that each can holdfour 64-bit words If the write buffer is empty, the data and the full address arewritten in the buffer, and the write is finished from the CPU’s perspective; theCPU continues working while the write buffer prepares to write the word tomemory If the buffer contains other modified blocks, the addresses are checked

to see if the address of this new data matches the address of the valid write bufferentry; if so, the new data are combined with that entry, called write merging.Without this optimization, four stores to sequential addresses would fill thebuffer, even though these four words easily fit within a single block of the writebuffer when merged Figure 5.6 shows a write buffer with and without writemerging If the buffer is full and there is no address match, the cache (and CPU)must wait until the buffer has an empty entry

So far we have assumed the common case of a cache hit What happens on amiss? On a read miss, the cache sends a stall signal to the CPU telling it to wait,and 32 bytes are read from the next level of the hierarchy The path to the nextlower level is 16 bytes wide in the DEC 3000 model 800 workstation, one of sev-eral models that use the 21064 That takes 5 clock cycles per transfer, or 10 clockcycles for all 32 bytes Since the data cache is direct mapped, there is no choice

on which block to replace Replacing a block means updating the data, the dress tag, and the valid bit On a write miss, the CPU writes “around” the cache

ad-to lower-level memory and does not affect the cache; that is, the 21064 followsthe no-write-allocate rule

We have seen how it works, but the data cache cannot supply all the memoryneeds of the processor: the processor also needs instructions Although a singlecache could try to supply both, it can be a bottleneck For example, when a load

or store instruction is executed, the pipelined processor will simultaneously quest both a data word and an instruction word Hence a single cache wouldpresent a structural hazard for loads and stores, leading to stalls One simple way

re-to conquer this problem is re-to divide it: one cache is dedicated re-to instructions andanother to data Separate caches are found in most recent processors, includingthe Alpha AXP 21064 It has an 8-KB instruction cache that is nearly identical toits 8-KB data cache in Figure 5.5

Trang 25

The CPU knows whether it is issuing an instruction address or a data address,

so there can be separate ports for both, thereby doubling the bandwidth between

the memory hierarchy and the CPU Separate caches also offer the opportunity of

optimizing each cache separately: different capacities, block sizes, and

associa-tivities may lead to better performance (In contrast to the instructioncaches and

data caches of the 21064, the terms unified or mixed are applied to caches that can

contain either instructions or data.)

Figure 5.7 shows that instruction caches have lower miss rates than data

caches Separating instructions and data removes misses due to conflicts between

instruction blocks and data blocks, but the split also fixes the cache space devoted

to each type Which is more important to miss rates? A fair comparison of

sepa-rate instruction and data caches to unified caches requires the total cache size to

be the same For example, a separate 1-KB instruction cache and 1-KB data

cache should be compared to a 2-KB unified cache Calculating the average miss

rate with separate instruction and data caches necessitates knowing the

percent-age of memory references to each cache Figure 2.26 on ppercent-age 105 suggests the

FIGURE 5.6 To illustrate write merging, the write buffer on top does not use it while

the write buffer on the bottom does Each buffer has four entries, and each entry holds four

64-bit words The address for each entry is on the left, with valid bits (V) indicating whether

or not the next sequential four bytes are occupied in this entry The four writes are merged

into a single buffer entry with write merging; without it, all four entries are used Without write

merging, the blocks to the right in the upper drawing would only be used for instructions that

wrote multiple words at the same time (The Alpha is a 64-bit architecture so its buffer is really

8 bytes per word.)

Trang 26

split is 100%/(100% + 26% + 9%) or about 75% instruction references to (26% +9%)/(100% + 26% + 9%) or about 25% data references Splitting affects perfor-mance beyond what is indicated by the change in miss rates, as we shall see in alittle bit

Cache Performance

Because instruction count is independent of the hardware, it is tempting to ate CPU performance using that number As we saw in Chapter 1, however, suchindirect performance measures have waylaid many a computer designer Thecorresponding temptation for evaluating memory-hierarchy performance is to con-centrate on miss rate, because it, too, is independent of the speed of the hardware

evalu-As we shall see, miss rate can be just as misleading as instruction count A bettermeasure of memory-hierarchy performance is the average time to access memory:

Average memory access time = Hit time + Miss rate × Miss penaltywhere Hit time is the time to hit in the cache; we have seen the other two terms be-fore The components of average access time can be measured either in absolutetime—say, 2 nanoseconds on a hit—or in the number of clock cycles that the CPUwaits for the memory—such as a miss penalty of 50 clock cycles Remember thataverage memory access time is still an indirect measure of performance; although

it is a better measure than miss rate, it is not a substitute for execution time

This formula can help us decide between split caches and a unified cache

E X A M P L E Which has the lower miss rate: a 16-KB instruction cache with a 16-KB

data cache or a 32-KB unified cache? Use the miss rates in Figure 5.7 to help calculate the correct answer Assume a hit takes 1 clock cycle and the miss penalty is 50 clock cycles, and a load or store hit takes 1 extra clock cycle on a unified cache since there is only one cache port to satisfy

Trang 27

two simultaneous requests Using the pipelining terminology of the

previ-ous chapter, the unified cache leads to a structural hazard What is the

av-erage memory access time in each case? Assume write-through caches

with a write buffer and ignore stalls due to the write buffer.

A N S W E R As stated above, about 75% of the memory accesses are instruction

references Thus, the overall miss rate for the split caches is

So the time for each organization is

Hence the split caches in this example—which offer two memory ports

per clock cycle, thereby avoiding the structural hazard—have a better

av-erage memory access time than the single-ported unified cache even

In Chapter 1 we saw another formula for the memory hierarchy:

CPU time = (CPU execution clock cycles + Memory stall clock cycles) × Clock cycle time

To simplify evaluation of cache alternatives, sometimes designers assume that all

memory stalls are due to cache misses since the memory hierarchy typically

dominates other reasons for stalls, such as contention due to I/O devices using

memory We use this simplifying assumption here, but it is important to account

for all memory stalls when calculating final performance!

The CPU time formula above raises the question whether the clock cycles for

a cache hit should be considered part of CPU execution clock cycles or part of

memory stall clock cycles Although either convention is defensible, the most

widely accepted is to include hit clock cycles in CPU execution clock cycles

Average memory access time

% instructions × ( Hit time + Instruction miss rate × Miss penalty ) +

=

% data × ( Hit time + Data miss rate × Miss penalty )

Average memory access timesplit75% × ( 1 + 0.64% × 50 ) + 25% × ( 1 + 6.47% × 50 )

Trang 28

Memory stall clock cycles can then be defined in terms of the number of ory accesses per program, miss penalty (in clock cycles), and miss rate for readsand writes:

mem-Memory stall clock cycles = Reads × Read miss rate × Read miss penalty

+ Writes × Write miss rate × Write miss penalty

We often simplify the complete formula by combining the reads and writes and

finding the average miss rates and miss penalty for reads and writes:

Memory stall clock cycles = Memory accesses × Miss rate × Miss penaltyThis formula is an approximation since the miss rates and miss penalties are of-ten different for reads and writes

Factoring instruction count (IC) from execution time and memory stall cycles,

we now get a CPU time formula that includes memory accesses per instruction,miss rate, and miss penalty:

Some designers prefer measuring miss rate as misses per instruction rather

than misses per memory reference:

The advantage of this measure is that it is independent of the hardware mentation For example, the 21064 instruction prefetch unit can make repeatedreferences to a single word (see section 5.10), which can artificially reduce themiss rate if measured as misses per memory reference rather than per instructionexecuted The drawback is that misses per instruction is architecture dependent;for example, the average number of memory accesses per instruction may be verydifferent for an 80x86 versus DLX Thus misses per instruction is most popularwith architects working with a single computer family They then use this version

imple-of the CPU time formula:

We can now explore the impact of caches on performance

E X A M P L E Let’s use a machine similar to the Alpha AXP as a first example Assume

the cache miss penalty is 50 clock cycles, and all instructions normally take 2.0 clock cycles (ignoring memory stalls) Assume the miss rate is 2%, and

CPU time IC CPIexecution Memory accesses

Instruction

+

Memory accesses × Miss rate

-



× × Clock cycle time

Trang 29

there is an average of 1.33 memory references per instruction What is the impact on performance when behavior of the cache is included?

A N S W E R

The performance, including cache misses, is

CPU timewith cache= IC × (2.0 + (1.33 × 2% × 50)) × Clock cycle time

= IC × 3.33 × Clock cycle time

The clock cycle time and instruction count are the same, with or without a cache, so CPU time increases with CPI from 2.0 for a “perfect cache” to 3.33 with a cache that can miss Hence, including the memory hierarchy

in the CPI calculations stretches the CPU time by a factor of 1.67 Without any memory hierarchy at all the CPI would increase to 2.0 + 50 × 1.33 or

As this example illustrates, cache behavior can have enormous impact on formance Furthermore, cache misses have a double-barreled impact on a CPUwith a low CPI and a fast clock:

per-1 The lower the CPIexecution, the higher the relative impact of a fixed number of

cache miss clock cycles

2 When calculating CPI, the cache miss penalty is measured in CPU clock

cycles for a miss Therefore, even if memory hierarchies for two computers areidentical, the CPU with the higher clock rate has a larger number of clockcycles per miss and hence the memory portion of CPI is higher

The importance of the cache for CPUs with low CPI and high clock rates is thusgreater, and, consequently, greater is the danger of neglecting cache behavior inassessing performance of such machines Amdahl’s Law strikes again!

Although minimizing average memory access time is a reasonable goal and

we will use it in much of this chapter, keep in mind that the final goal is to reduceCPU execution time The next example shows how these two can differ

E X A M P L E What is the impact of two different cache organizations on the

perfor-mance of a CPU? Assume that the CPI with a perfect cache is 2.0 and the clock cycle time is 2 ns, that there are 1.3 memory references per instruction, and that the size of both caches is 64 KB and both have a block size

of 32 bytes One cache is direct mapped and the other is two-way set sociative Figure 5.8 shows that for set-associative caches we must add

as-a multiplexer to select between the blocks in the set depending on the tas-ag

CPU time = IC×CPIexecution +Memory stall clock cycles -Instruction × Clock cycle time

Trang 30

match Since the speed of the CPU is tied directly to the speed of a cache hit, assume the CPU clock cycle time must be stretched 1.10 times to ac- commodate the selection multiplexer of the set-associative cache To the first approximation, the cache miss penalty is 70 ns for either cache organization (In practice it must be rounded up or down to an integer number

of clock cycles.) First, calculate the average memory access time, and then CPU performance Assume the hit time is one clock cycle Assume that the miss rate of a direct-mapped 64-KB cache is 1.4%, and the miss rate for a two-way set-associative cache of the same size is 1.0%

A N S W E R Average memory access time is

Average memory access time = Hit time + Miss rate × Miss penalty

FIGURE 5.8 A two-way set-associative version of the 8-KB cache of Figure 5.5, ing the extra multiplexer in the path Unlike the prior figure, the data portion of the cache

show-is drawn more realshow-istically, with the two leftmost bits of the block offset combined with the dex to address the desired 64-bit word in memory, which is then sent to the CPU.

Lower level memory

Tag

<22>

2:1 M u x

Trang 31

Thus, the time for each organization is

Average memory access time1-way = 2.0 + (.014 × 70) = 2.98 ns

Average memory access time2-way = 2.0 × 1.10 + (.010 × 70) = 2.90 ns

The average memory access time is better for the two-way set-associative cache.

CPU performance is

Substituting 70 ns for (Miss penalty × Clock cycle time), the performance

of each cache organization is

and relative performance is

In contrast to the results of average memory access time comparison, the direct-mapped cache leads to slightly better average performance because the clock cycle is stretched for all instructions for the two-way case, even if there are fewer misses Since CPU time is our bottom-line evaluation, and since direct mapped is simpler to build, the preferred cache is

Improving Cache Performance

The increasing gap between CPU and main memory speeds shown in Figure 5.1has attracted the attention of many architects A bibliographic search for the years

1989 –95 revealed more than 1600 research papers on the subject of caches Yourauthors’ job was to survey all 1600 papers, decide what is and is not worthwhile,translate the results into a common terminology, reduce the results to their es-sence, write in an intriguing fashion, and provide just the right amount of detail!Fortunately, the average memory access time formula gave us a framework topresent cache optimizations as well as the techniques for improving caches:

CPU time IC CPIExecution Misses

Instruction ×Miss penalty× Clock cycle time

- × Miss rate × Miss penalty × Clock cycle time

Trang 32

Hence we organize 15 cache optimizations into three categories:

■ Reducing the miss rate (Section 5.3)

■ Reducing the miss penalty (Section 5.4)

■ Reducing the time to hit in the cache (Section 5.5)Figure 5.29 on page 427 concludes with a summary of the implementation com-plexity and the performance benefits of the 15 techniques presented

Most cache research has concentrated on reducing the miss rate, so that is where

we start our exploration To gain better insights into the causes of misses, we startwith a model that sorts all misses into three simple categories:

■ Compulsory—The very first access to a block cannot be in the cache, so the block must be brought into the cache These are also called cold start misses or first reference misses.

■ Capacity—If the cache cannot contain all the blocks needed during execution

of a program, capacity misses will occur because of blocks being discarded andlater retrieved

■ Conflict—If the block placement strategy is set associative or direct mapped,

conflict misses (in addition to compulsory and capacity misses) will occur cause a block can be discarded and later retrieved if too many blocks map to its

be-set These are also called collision misses or interference misses.

Figure 5.9 shows the relative frequency of cache misses, broken down bythe “three C’s.” Figure 5.10 presents the same data graphically The top graphshows absolute miss rates; the bottom graph plots percentage of all the misses bytype of miss as a function of cache size To show the benefit of associativity, con-flict misses are divided into misses caused by each decrease in associativity Hereare the four divisions:

■ Eight-way—conflict misses due to going from fully associative (no conflicts)

■ One-way—conflict misses due to going from two-way associative to one-way

associative (direct mapped)

Trang 33

5.3 Reducing Cache Misses 391

Cache size

Degree associative

Total miss rate

Miss rate components (relative percent)

(Sum = 100% of total miss rate)

FIGURE 5.9 Total miss rate for each size cache and percentage of each according to the “three C’s.” Compulsory

misses are independent of cache size, while capacity misses decrease as capacity increases, and conflict misses decrease

as associativity increases Gee et al [1993] calculated the average D-cache miss rate for the SPEC92 benchmark suite with 32-byte blocks and LRU replacement on a DECstation 5000 Figure 5.10 shows the same information graphically The compulsory rate was calculated as the miss rate of a fully associative 1-MB cache Note that the 2:1 cache rule of thumb (inside front cover) is supported by the statistics in this table: a direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2.

Trang 34

As we can see from the figures, the compulsory miss rate of the SPEC92 grams is very small, as it is for many long-running programs

pro-Having identified the three C’s, what can a computer designer do about them?Conceptually, conflicts are the easiest: Fully associative placement avoids allconflict misses Full associativity is expensive in hardware, however, and mayslow the processor clock rate (see the example above), leading to lower overallperformance

There is little to be done about capacity except to enlarge the cache If theupper-level memory is much smaller than what is needed for a program, and a

FIGURE 5.10 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to three C’s for the data in Figure 5.9 The top diagram is the

actual D-cache miss rates, while the bottom diagram is scaled to the direct-mapped miss ratios.

2 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14

Compulsory 1-way

Miss rate per type

100%

Compulsory Cache size (KB)

Capacity

Trang 35

significant percentage of the time is spent moving data between two levels in the

hierarchy, the memory hierarchy is said to thrash Because so many replacements

are required, thrashing means the machine runs close to the speed of the level memory, or maybe even slower because of the miss overhead

lower-Another approach to improving the three C’s is to make blocks larger to duce the number of compulsory misses, but, as we shall see, large blocks can in-crease other kinds of misses

re-The three C’s give insight into the cause of misses, but this simple model hasits limits; it gives you insight into average behavior but may not explain an indi-vidual miss For example, changing cache size changes conflict misses as well ascapacity misses, since a larger cache spreads out references to more blocks Thus,

a miss might move from a capacity miss to a conflict miss as cache size changes.Note that the three C’s also ignore replacement policy, since it is difficult tomodel and since, in general, it is less significant In specific circumstances the re-placement policy can actually lead to anomalous behavior, such as poorer missrates for larger associativity, which is contradictory to the three C’s model.Alas, many of the techniques that reduce miss rates also increase hit time ormiss penalty The desirability of reducing miss rates using the seven techniquespresented in the rest of this section must be balanced against the goal of makingthe whole system fast This first example shows the importance of a balancedperspective

First Miss Rate Reduction Technique: Larger Block Size

This simplest way to reduce miss rate is to increase the block size Figure 5.11shows the trade-off of block size versus miss rate for a set of programs and cachesizes Larger block sizes will reduce compulsory misses This reduction occursbecause the principle of locality has two components: temporal locality and spa-tial locality Larger blocks take advantage of spatial locality

At the same time, larger blocks increase the miss penalty Since they reducethe number of blocks in the cache, larger blocks may increase conflict misses andeven capacity misses if the cache is small Clearly there is little reason to increase

the block size to such a size that it increases the miss rate, but there is also no

benefit to reducing miss rate if it increases the average memory access time; theincrease in miss penalty may outweigh the decrease in miss rate

Trang 36

E X A M P L E Figure 5.12 shows the actual miss rates plotted in Figure 5.11 Assume

the memory system takes 40 clock cycles of overhead and then delivers

16 bytes every 2 clock cycles Thus, it can supply 16 bytes in 42 clock cycles, 32 bytes in 44 clock cycles, and so on Which block size has the minimum average memory access time for each cache size in

Figure 5.12?

FIGURE 5.11 Miss rate versus block size for five different-sized caches Each line

rep-resents a cache of different size Figure 5.12 shows the data used to plot these lines This graph is based on the same measurements found in Figure 5.10.

er miss rate than 32-byte blocks In this example, the cache would have to be 256 KB in order

for a 256-byte block to decrease misses.

5%

16

Block size 32

0%

Trang 37

A N S W E R Average memory access time is

If we assume the hit time is one clock cycle independent of block size, then the access time for a 16-byte block in a 1-KB cache is

Average memory access time = 1 + (15.05% × 42) = 7.321 clock cycles

and for a 256-byte block in a 256-KB cache the average memory access time is

Average memory access time = 1 + (0.49% × 72) = 1.353 clock cycles

Figure 5.13 shows the average memory access time for all block and cache sizes between those two extremes The boldfaced entries show the fastest block size for a given cache size: 32 bytes for 1-KB, 4-KB, and 16-

KB caches and 64 bytes for the larger caches These sizes are, in fact, popular block sizes for processor caches today.

■

As in all of these techniques, the cache designer is trying to minimize both themiss rate and the miss penalty The selection of block size depends on both thelatency and bandwidth of the lower-level memory: high latency and high band-width encourage large block size since the cache gets many more bytes per miss for

a small increase in miss penalty Conversely, low latency and low bandwidth courage smaller block sizes since there is little time saved from a larger block—twice the miss penalty of a small block may be close to the penalty of a block twicethe size—and the larger number of small blocks may reduce conflict misses.After seeing the positive and negative impact of larger block size on compul-sory and capacity misses, we next look at the potential of higher associativity toreduce conflict misses

Trang 38

Second Miss Rate Reduction Technique:

Higher Associativity

Figures 5.9 and 5.10 above show how miss rates improve with higher associativity.There are two general rules of thumb that can be gleaned from these figures Thefirst is that eight-way set associative is for practical purposes as effective in reduc-ing misses for these sized caches as fully associative The second observation,

called the 2:1 cache rule of thumb and found on the front inside cover, is that a direct-mapped cache of size N has about the same miss rate as a 2-way set- associative cache of size N/2.

Like many of these examples, improving one aspect of the average memory cess time comes at the expense of another Increasing block size reduced miss ratewhile increasing miss penalty, and greater associativity can come at the cost of in-creased hit time Hill [1988] found about a 10% difference in hit times for TTL orECL board-level caches and a 2% difference for custom CMOS caches for direct-mapped caches versus two-way set-associative caches Hence the pressure of afast processor clock cycle encourages simple cache designs, but the increasingmiss penalty rewards associativity, as the following example suggests

ac-E X A M P L ac-E Assume that going to higher associativity would increase the clock cycle

as suggested below:

Clock cycle time2-way = 1.10 × Clock cycle time1-wayClock cycle time4-way = 1.12 × Clock cycle time1-wayClock cycle time8-way = 1.14 × Clock cycle time1-way

Assume that the hit time is 1 clock cycle, that the miss penalty for the direct-mapped case is 50 clock cycles, and that the miss penalty need not

be rounded to an integral number of clock cycles Using Figure 5.9 for miss rates, for which cache sizes are each of these three statements true?

Average memory access time8-way < Average memory access time4-wayAverage memory access time4-way < Average memory access time2-wayAverage memory access time2-way < Average memory access time1-way

A N S W E R Average memory access time for each associativity is

Average memory access time8-way = Hit time8-way + Miss rate8-way× Miss penalty1-way = 1.14 + Miss rate8-way× 50 Average memory access time4-way = 1.12 + Miss rate4-way× 50

Average memory access time2-way = 1.10 + Miss rate2-way× 50

Average memory access time1-way = 1.00 + Miss rate1-way× 50

Trang 39

The miss penalty is the same time in each case, so we leave it as 50 clock cycles For example, the average memory access time for a 1-KB direct- mapped cache is

Average memory access time1-way = 1.00 + (0.133 × 50) = 7.65

and the time for a 128-KB, eight-way set-associative cache is

Average memory access time8-way = 1.14 + (0.006 × 50) = 1.44

Using these formulas and the miss rates from Figure 5.9, Figure 5.14 shows the average memory access time for each cache and associativity The figure shows that the formulas in this example hold for caches less than or equal to 16 KB Starting with 32 KB, the average memory access time of four-way is less than two-way, and two-way is less than one-way, but eight-way cache is not less than four-way.

Note that we did not account for the slower clock rate on the rest of the program in this example, thereby understating the advantage of direct- mapped cache.

■

Third Miss Rate Reduction Technique: Victim Caches

Larger block size and higher associativity are two classic techniques to reducemiss rates that have been considered by architects since the earliest caches Start-ing with this subsection, we see more recent inventions to reduce miss ratewithout affecting the clock cycle time or the miss penalty

Trang 40

One solution that reduces conflict misses without impairing clock rate is toadd a small, fully associative cache between a cache and its refill path

Figure 5.15 shows the organization This victim cache contains only blocks that

are discarded from a cache because of a miss—“victims”—and are checked on amiss to see if they have the desired data before going to the next lower-levelmemory If it is found there, the victim block and cache block are swapped.Jouppi [1990] found that victim caches of one to five entries are effective at re-ducing conflict misses, especially for small, direct-mapped data caches Depend-ing on the program, a four-entry victim cache removed 20% to 95% of theconflict misses in a 4-KB direct-mapped data cache

Fourth Miss Rate Reduction Technique:

Pseudo-Associative Caches

Another approach to getting the miss rate of set-associative caches and the hit

speed of direct mapped is called pseudo-associative or column associative A

cache access proceeds just as in the direct-mapped cache for a hit On a miss,however, before going to the next lower level of the memory hierarchy, another

FIGURE 5.15 Placement of victim cache in the memory hierarchy.

CPU address Data

in Data out

Write buffer

Lower level memory

Định dạng
Số trang	99
Dung lượng	333,42 KB