Compiler driver memory system optimization using speculative execution

Towards this, we introduce the Load Dependence Graph LDG,which is a sub-graph of the traditional Program Dependence Graph PDG thatcomputes the address of a load instruction.. Subsequentl

Trang 1

COMPILER DRIVEN MEMORY SYSTEM OPTIMIZATION USING SPECULATIVE

EXECUTION

HARIHARAN SANDANAGOBALANE

(B Tech, Pondicherry Engineering College)

A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE

APRIL 2004

Trang 2

Profound and sincere thanks are due to my supervisor A/P Wong Weng Fai forhis excellent guidance, constant support and encouragement Working with himhas been a very pleasing experience, both personally and intellectually I appreciatehis help and support which were on many occasions, unexpected, but certainly verywelcome He has served as a good role model for a supervisor

I would like to thank Dr Rodric Rabbah from Massachusetts Institute of nology for his valuable comments and guidance throughout the project

Tech-I thank the members of the embedded systems research laboratory and myfriends for giving me good company in Singapore I would also like to thankDavid Chew, whose latex style files for NUS thesis made lexing a breeze

Last, but not the least, I thank my family for their understanding and support

HariharanApril 2004

ii

Trang 3

1.1 Motivation 1

1.2 Research Goals 4

1.3 Technique Overview 5

1.4 Thesis Overview 6

2 Related Work 7 2.1 Hardware Techniques 9

2.2 Software Techniques 12

iii

Trang 4

Contents iv

2.2.1 Preliminary Work 12

2.2.2 Prefetching methods for pointer intensive applications 15

2.2.3 Thread Based techniques 17

2.3 Application Restructuring 20

2.4 Limitations 20

3 LDG and PEPSE 22 3.1 Load Dependence Graph 22

3.1.1 Delinquent Load Selection 23

3.1.2 LDG Creation 25

3.2 PEPSE 28

3.2.1 Optimizations 30

3.2.2 Pointer Applications 33

4 PEPSE Implementation 37 4.1 Open Research Compiler 37

4.2 Profiler Implementation 40

4.3 PEPSE Implementation 44

5 Evaluation Framework and Results 47 5.1 Evaluation Framework 47

5.2 Results 51

6 Conclusions 56 6.1 Summary of the thesis 56

6.2 Future Research Directions 57

Trang 5

Contents v

Trang 6

Wide-issue microprocessors are capable of remarkable execution rates, but theygenerally achieve only a fraction of their peak instruction throughput on real pro-grams This discrepancy is due to performance degrading events, largely branchmispredictions and cache misses In this work we have addressed the performancedegradation due to the latter through the use of Program Embedded Precompu-tation using Speculative Execution (PEPSE)

Our work on program embedded precomputation using speculative execution (PEPSE)aims at providing a unified framework to mitigate the ever-widening gap betweenthe data processing rate of the processor and the data delivery rate of the mem-ory subsystem Towards this, we introduce the Load Dependence Graph (LDG),which is a sub-graph of the traditional Program Dependence Graph (PDG) thatcomputes the address of a load instruction The LDG affords a unique characteri-zation of the program structure and its memory reference patterns and facilitatesthe discovery of appropriate memory management techniques

In the context of data prefetching, we illustrate how PEPSE can accurately dict and effectively prefetch future memory references with negligible overhead for

pre-vi

Trang 7

Summary vii

both regular array-based applications as well as irregular pointer-based tions We narrow down the scope of the optimizations by limiting our processingonly to delinquent loads in a program, identified with the help of a profiler LDGsare created only for those delinquent loads Subsequently, speculative versions ofthe LDG operations are statically scheduled along with a prefetch instruction forthe computed address, such that these instructions execute and prefetch the valuebefore the actual load is encountered resulting in either an elimination or reduction

applica-of the processor stall cycles due to the load instruction Our prototype tion of the optimizations using LDGs within the Open Research Compiler (ORC),

implementa-an open source compiler for the Itimplementa-anium Processor Family (IPF), delivered aging results For a 900 MHz Itanium 2 server, we could achieve speedups rangingfrom 1.05 to 2.14 for several benchmarks from SPEC and OLDEN suites

Trang 8

encour-List of Tables

3.1 Delinquent Load Statistics 245.1 Benchmark Evaluation Suite 495.2 CPU user time as a function of the number of embedded LDGs 535.3 The user CPU time and total execution cycles for each benchmark 545.4 The user CPU time and the dynamic number of operations for eachbenchmark 55

viii

Trang 9

List of Figures

1.1 Performance Trends 2

2.1 DGP hardware 10

2.2 Prefetching based on Mowry’s Work 14

3.1 An LDG example 27

3.2 The scheduling algorithm 29

3.3 Induction Unrolling in arrays 32

3.4 Unrolling Example for pointers 35

3.5 Induction Unrolling for Pointer-chasing code 36

4.1 Structure of an Operation 39

4.2 Profiler Implementation 41

4.3 The structure of a dependence edge 45

5.1 Itanium 2 Results 52

ix

Trang 10

Furthermore, exponential increases in processor speeds continue to widen the gapbetween the data consumption rate of the processor and the data delivery rate

of the memory High computation power becomes useless if it is not backed by

a powerful memory system Historically, the processor performances have beenincreasing at a rate of 35% per year till 1986, and 55% per year since then On theother hand, the access time of DRAM has been improving at a rate of mere 7%per year [11] Figure 1.1 illustrates the performance disparity between processor

1

Trang 12

la-Explicitly parallel processors have features derived from both VLIW and scalar architectures They use large instruction words and issue multiple instruc-tions per cycle They continue to gain wider acceptance and play a significant role

super-in various aspects of the computer super-industry, rangsuper-ing from the high end server forms such as the Itanium Processor Family(IPF) [24], to digital signal processingengines such as the T1-C6x processors [12], to custom computing systems such asthe Trimedia VLIW products [27] and the HP-STMicroelectronics Lx processors[23] These EPIC processors expose the architecture to the compiler by exten-sions to the Instruction Set Architecture(ISA) The extensions enable the compiler

plat-to communicate with the hardware through hints attached plat-to the instructions orthrough special instructions and hence allow them to manage the data movementacross the memory hierarchy better

During compilation, it is important to have the ability to predict the future memoryaccesses and the access patterns so as to utilize the EPIC’s features to amelioratethe difference in performance between the processor and the memory system Thisforesight would enable the compiler to make more informed decisions about theplacement and evacuation of data in caches, which could be communicated to thehardware through the ISA Towards this, a lot of hardware and software tech-niques have been proposed that prefetch the data ahead of its actual consumption,

1 Tertiary cache miss latency is the latency due to a memory access.

Trang 13

1.2 Research Goals 4

resulting in a significant performance improvement

Another orthogonal line of research towards reducing the memory bottleneck lem is to improve the data locality by reordering the execution of iterations Animportant example of such a transformation is blocking [32, 31, 9] Instead ofoperating on entire rows or columns of an array, blocked algorithms operate onsubmatrices or blocks, so that data loaded into faster levels of the memory hierar-chy are reused Other useful transformations include unimodular loop transformslike interchange, skewing and reversal [31] These transformations complementblocking and hence can be used together with it to enhance the application’s per-formance Since these transformations improve code’s data locality, they not onlyreduce the effective memory access time but also reduce the memory bandwidthrequirement Since these transformations aim at reducing the capacity misses, theycomplement prefetching methods which help reduce the cold misses that occur due

prob-to the first access prob-to a data item Hence, they can be used prob-together prob-to achieve evenbetter performances

The objective of our research is to provide a unified framework for alleviating thememory bandwidth bottleneck using static compilation techniques The researchgoals that we set out for our work are

1 To devise an algorithm that would be effective for both array and pointerbased programs

2 The algorithm should only utilize the architectural features that are monly available and should not require drastic changes to the underlyingarchitecture

Trang 14

pre-We focus our attention to only loads identified as delinquent by the profiler LDGsare created for these instructions by starting from them and moving up and in-cluding any instruction that contributes to their address calculation Ideally, thisLDG creation is stopped when it has moved a distance δ + α from the delinquentload, where δ corresponds to the average latency of the load operation and α tothe schedule length of LDG itself But other constrains, such as explosion of LDGlength and absence of enough free slots might stop it earlier Program EmbeddedPrecomputation via Speculative Execution(PEPSE) inserts a speculative version

of the LDG instructions statically in the program along with a prefetch for theload in the empty2 slots, as much as possible These instructions would execute in

2 NOPs are considered to be empty slots.

Trang 15

We implemented a prototype of our optimizations on Open Research Compilerand obtained promising results Our proposed methodology relies heavily on spec-ulation, a concept that is widely used to improve ILP and overcome long branchdelays.

Chapter 2 gives a survey of the different techniques that have been proposed toaddress the memory bandwidth problem and show how our technique differs fromthem Chapter 3 describes the Load Dependence Graph and details on how theyare created and embedded in the application using PEPSE Chapter 4 explainsthe implementation of PEPSE scheme in the Open Research Compiler Chapter

5 discusses the experimental setup and the performance results obtained usingPEPSE on an Itanium 2 machine Chapter 6 concludes the thesis and gives pointersfor future directions of research

Trang 16

Chapter 2

Related Work

The speed of computer systems have been increasing steadily through the years.This is partly through the advancement of technology and partly because of thecertain properties exhibited by the programs The most important program prop-erty that is exploited is the principle of locality Programs tend to reuse data andinstructions they have used recently A widely held rule of thumb is that a pro-gram spends 90% of its execution time in only 10% of the code An implication

of locality is that we can predict with reasonable accuracy what instructions anddata a program will use in the near future based on its accesses in the recent past.Principles of locality also applies to data accesses, though not as strongly as tocode accesses Two different types of locality have been observed [11] TemporalLocality states that recently accessed items are likely to be accessed in the nearfuture This happens, say, when every iteration of an outer loop accesses the sameset of items in the inner loop Spatial Locality says that items whose addresses arenear one another tend to be referenced close together in time This happens whenthe loop has a sequential access along the data items placed contiguous to eachother

To exploit the locality in the programs, a small cache memory was added to the

7

Trang 17

processor An access to the cache memory is an order of magnitude faster than amemory access, which is generally off the processor chip But still, the addition ofcache memory doesn’t serve as a panacea to the memory wall1 problem This isbecause not all data accesses hit the cache and the misses would have to be served

by the slower main memory and the processor might have to be stalled till the dataitem becomes available

There are three kinds of cache misses : Conflict misses, Compulsory misses andCapacity misses [13] Conflict misses are those that would be avoided by having afully associative cache with LRU replacement They occur because two data itemsconflict for the same cache line and hence the earlier one needs to be evacuated

to give way for the latter, even though it may be accessed again soon Capacitymisses occur when cache is too small to hold data between references Compulsorymisses occur in every cache organization because they represent the first access tothe data item Past research on conflict misses have reduced them largely withoutresorting to fully associative caches, by the use of set-associative caches The set-associative caches provide a trade-off between cache misses on the one side and theaccess time and energy on the other side

To effectively reduce capacity misses, one has to either enlarge the cache or range the program so that the working set would fit in the cache, both of whichhas been done to a large extent Nowadays, the amount of on-chip cache is quitelarge and we have a hierarchy of caches so that the large caches do not increasethe average memory access time Tiling or Blocking [9] and loop interchange arecommonly used compiler techniques to rearrange the memory accesses in the pro-gram to match the cache structure But, some form of prefetching is required tominimize compulsory misses, also called cold misses There are various hardware

rear-1 The problem of the memory system not being fast enough to serve the processor is commonly called the memory wall problem.

Trang 18

imple-to the address of the next datum in the address space, assuming it will be accessed

in the near future This method has the advantage of allowing sequential arrayaccesses to be fetched with only one miss for the first item Though both of thesemethods reduce the miss rate in a few circumstances, they cannot be disabled

in other circumstances since they are implemented in hardware For example, incase of array access in a loop with a high step size or a pointer chasing code witharbitrary memory access, both long cache lines and hardware prefetching wouldprefetch values that would not be used in the future In such cases, it increases thedata traffic between the cache and the main memory and also pollutes the cachewith unwanted data

In 1991, Baer and Chen [4] proposed a scheme that uses a history buffer to detectstrides In their scheme, a “look ahead PC” speculatively walks through the pro-gram, ahead of the normal PC, using branch prediction The processor is extendedwith a Reference Prediction Table(RPT) which is used to keep track of previousreference addresses and associated strides When the look ahead PC hits a loadand finds a matching entry in this table, it issues a prefetch They evaluated the

Trang 19

2.1 Hardware Techniques 10

scheme in a memory system with 30 cycles miss latency and found good results

In the context of multiprocessors, Multiple-Context Processors [30] were duced, where each processor maintains multiple processes as multiple contexts andswitches between them when there is a long latency load in one context In thismanner the memory latency of one context can be overlapped with computation ofanother context The interval between long latency operations is becoming fairlylarge, allowing just a handful of hardware contexts to hide most of the latency.But this method has the disadvantage of context switch overhead and the highprocessor complexity resulting from the inclusion of contexts in it Also, since thedifferent contexts share a single processor cache, they can interfere with each other,both constructively and more often, destructively

intro-Fetch

I-Cache

OP2OP1InstIT

Predecode RDV Decode Execute Writeback Commit

ROB

Dependence Graph Generator Dependence Graph Buffer Precomputation Engine SRF

Data Pre fetches

Figure 2.1: DGP hardware

More recently, Annavaram et.al [22] have introduced an extension to the sor to pre-compute the load address and issue a prefetch Figure 2.1 shows the

Trang 20

proces-2.1 Hardware Techniques 11

additional hardware required for this implementation The fundamental idea ofthis method is to pre-compute the address of a load available in the InstructionFetch Queue(IFQ), instead of predicting it, and then issuing a prefetch The IFQ

is extended(with extra columns) to help dependence graph creation and the decode stage is also modified to fill in those extra columns The dependence graph

pre-of a load/store instruction, I , in the IFQ is the set pre-of all unexecuted instructionswaiting in the IFQ, that contribute to the address calculation of I The Depen-dence Graph Generator generates the graph based on the dependence informationavailable in the OP1 and OP2 columns of IFQ, which contains pointers to theinstructions that produce the values for operand one and two respectively

The processor is augmented with a Precomputation Engine(PE) which is used toexecute the dependence graphs stored in the dependence graph buffer The PEexecutes instructions speculatively The results generated by the PE are used onlyfor prefetching data, and in particular, they never update the architected state ofthe main processor Note that the dependence graph generation does not removeany instruction from the IFQ2: Consequently, all precomputed instructions will beexecuted in the normal manner by the main processor pipeline The precomputa-tion engine has a scratch register file(SRF) to store the live results of precomputedinstructions PE executes at most one instruction every cycle, and hence SRFneeds only two read ports and one write port If the OP field of an operand is notnull, it would have been generated by an already executed instruction and henceavailable in the SRF If it is null, the PE obtains the corresponding operand value

by accessing the processor’s register file and the Re-ordering buffer3 for forwardinguncommitted register values

In their work, Roth et.al [3] also use an extra computation engine4 to run ahead of

2 It just makes speculative copies.

3 The processor’s register file and ROB each need two additional read ports for PE accesses.

4 They call it prefetch engine.

Trang 21

the processor, executing only load instructions that are required to iterate throughthe Linked Data Structure Dependence relationships between loads that produceaddresses and loads that consume these addresses is exploited by constructing acompact representation for them and their traversal To achieve prefetching, theprefetch engine speculatively traverses this representation ahead of the executingprogram Since the prefetch engine executes only the loads that are required totraverse through the data structure, this engine initiates accesses faster, producingthe desired prefetching effect

Though some of the hardware techniques are effective in certain circumstances,they are not flexible It would be hard to adapt the hardware technique to suit agiven program In the next section we review some of the software techniques forprefetching

Software prefetching was introduced by Callahan et.al [8] and since then severalprefetching algorithms [28, 33, 20] have been proposed and implemented Softwareprefetching needs hardware support in the form of a special prefetch instruction,which would issue a non-blocking prefetch The cache needs to be lockup-free [18],that is, the cache must allow multiple outstanding misses Otherwise, an outstand-ing prefetch instruction might block a load instruction from the original program,degrading its performance Also, this instruction should not affect the correctness

of the program, viz., the insertion of prefetch should not raise exceptions or produceincorrect results, if the speculative address is wrong These hardware supports areavailable in almost all processors nowadays, since, even with simple algorithms [5]

Trang 22

, prefetching is effective in overlapping the memory latency with other useful putation Software techniques introduced in this section are compiler algorithmswhich insert prefetch instructions along with the original program to avoid theprocessor stalls due to memory accesses

com-The first successful prefetching algorithm, which is implemented most commonly

in compilers today, was devised by Mowry [28] The domain of this algorithm is theset of array accesses whose indices are affine functions of loop indices A substantialamount of data references in scientific code belong to this domain There are threemajor steps in this prefetching algorithm

1 For each reference, determine the accesses that are likely to be cache missesand therefore need to be prefetched

2 Isolate the predicted cache miss instances through loop splitting This avoidsthe overhead of adding conditional statements to the loop bodies or addingunnecessary prefetches

3 Software pipeline prefetches for all cache misses

The first step determines those references that are likely to cause a cache miss.This locality analysis consists of discovering data reuses within a loop nest anddetermining whether the set of reuses would be exploited by a particular cacheconfiguration The reuse could be one of spatial, temporal or group reuses In theexample program of figure 2.2a, there is a spatial reuse in the access of A[i][j] if thecache line size is larger than an array element size There is also a temporal reuse

of B[j][0] in the outer loop, viz., every time around the outer loop same elements of

B array are accessed But, whether this reuse would turn into a cache hits depends

on the size of the cache and the iteration count of the inner loop In this case, sincethe iteration count of the inner loop is small(100), this reuse would be converted

Trang 23

for ( i = 0 ; i < 3 ; i ++ ) for( j = 0 ; j < 100 ; j ++ )

A [ i ] [ j ] = B [ j ] [ 0 ] + B [ j + 1 ] [ 0 ] ;

a) Source Program

prefetch ( & A [ 0 ] [ 0 ] ) ; for ( j = 0 ; j < 6 ; j + = 2 ) { prefetch ( & B [ j + 1 ] [ 0 ] ) ; prefetch ( & B [ j + 2 ] [ 0 ] ) ; prefetch ( & A [ 0 ] [ j + 1 ] ) ; }

for ( j = 0 ; j < 94 ; j + = 2 ) { prefetch ( & B [ j + 7 ] [ 0 ] ) ; prefetch ( & B [ j + 8 ] [ 0 ] ) ; prefetch ( & A [ 0 ] [ j + 7 ] ) ;

A [ 0 ] [ j ] = B [ j ] [ 0 ] + B [ j + 1 ] [ 0 ] ;

A [ 0 ] [ j + 1 ] = B [ j + 1 ] + B [ j + 2 ] [ 0 ] ; }

for ( j = 94 ; j < 100 ; j + = 2 ) {

A [ 0 ] [ j ] = B [ j ] [ 0 ] + B [ j + 1 ] [ 0 ] ;

A [ 0 ] [ j + 1 ] = B [ j + 1 ] + B [ j + 2 ] [ 0 ] ; }

for ( i = 1 ; i < 3 ; i ++ ) { prefetch ( & A [ i ] [ 0 ] ; for ( j = 0 ; j < 6 ; j + = 2 ) prefetch ( & A [ i ] [ j + 2 ] ) ; for ( j = 0 ; j < 94 ; j + = 2 ) prefetch ( & A [ i ] [ j + 7 ] ) ;

A [ i ] [ j ] = B [ j ] [ 0 ] + B [ j + 1 ] [ 0 ] ;

A [ i ] [ j + 1 ] = B [ j + 1 ] + B [ j + 2 ] [ 0 ] ; for ( j = 94 ; j < 100 ; j + = 2 ) {

A [ i ] [ j ] = B [ j ] [ 0 ] + B [ j + 1 ] [ 0 ] ;

A [ i ] [ j + 1 ] = B [ j + 1 ] + B [ j + 2 ] [ 0 ] ; }

Trang 24

To accommodate this,we can decompose loops into different sections so that thepredicates for all instances for the same section evaluate to the same value Thisprocess is known as loop splitting In general a predicate i=0 requires the firstiteration of the loop to be peeled The predicate (i mod n)=0 requires the loop

to be unrolled by a factor of n with only one prefetch Peeling and unrolling can

be applied recursively to handle predicates in nested loops Figure 2.2b shows theresult of applying these transformations to the loop-nest of figure 2.2a

applica-tions

One prefetching heuristic that works well for pointer based applications was duced by Lipasti et.al [20] In this, a prefetch instruction is inserted at the callsite for every function call with at least one pointer parameter The basic premise

intro-of this heuristic is that the pointer arguments passed on procedure calls are highlylikely to be dereferenced within the scope of the called procedure In this work,they had showed that with the insertion of just one or two prefetch instructions

at each call site, performance can be improved by 5-7% for benchmarks with high

Trang 25

call sites and lower procedure lengths, without significantly increasing the memorytraffic This particularly works well for C++ programs since, in the xlC implemen-tation of C++, the first argument is always the this pointer, which, intuitively has

a very high probability of being dereferenced in the ensuing method call But thiswork has a limited scope of prefetching only the pointers passed as parameters.Youfeng [33] introduced another heuristic for prefetching in pointer-based applica-tions This is based on the fact that some important load instructions in irregularprograms contain stride access patterns Namely, the difference between addresses

of two successive data accesses changes only infrequently at runtime But thesestrides are impossible to identify with compiler techniques since the memory allo-cation is decided at runtime In this work, they designed a new profiling methodthat integrates profiling for stride information and the traditional profiling for edgefrequency into a single profiling pass The collected stride information helps thecompiler to identify load instructions with stride patterns that can be prefetchedefficiently

The work by Chi Keung Luk and Todd Mowry [19] analyzes the major issues andchallenges involved in software-controlled prefetching for Recursive Data Struc-tures(RDS) like lists, trees and graphs In general, analyzing the address of heap-allocated objects is a very difficult problem for the compiler They propose threepossible solutions to overcome this problem

1 In a k-ary RDS5, all k pointers can be used in prefetching in the hope thatthe objects pointed to by the other pointers would also be used in the future

2 The first traversal through the RDS can be used to create a history Thehistory would add an extra pointer to each node to indicate which node is to

be prefetched from the current node Subsequent traverses through the RDS

5 Each node contains k pointers to other nodes.

Trang 26

In recent times, multithreaded processors are becoming popular There is an mous amount of research interest to investigate if these extra threads could be used

enor-in improvenor-ing the performance of senor-ingle threaded applications In the next section,

we review some of those techniques which use a helper thread for prefetching

Despite the importance of mispredicted branches and loads that miss in the cache,

a sequential processor is not able to prioritize these computations because it mustfetch all computations sequentially, regardless of their contribution to performance.Alleviating this by spawning separate threads to execute only the delinquent op-erations and other instructions that contribute to them is the fundamental ideabehind all thread based techniques

Speculative Data Driven Multithreading(DDMT) was introduced by Amir Rothet.al [25] In DDMT, critical computations are identified with the help of a profilerand annotated, so that they can execute stand alone When the processor predicts

an upcoming instance of critical instruction, it microarchitecturally forks a copy

of its computation as a new kind of speculative thread This thread executes inparallel with the main thread, but typically generates results faster These threadsexecute speculatively, they do not change the architected state of the machinethough they may impact the performance of the application

Collins et.al [15] extend the thread based latency tolerance ideas of Amir Roth

Trang 27

[25] In this work, they first identify delinquent loads6 with the help of a profiler.Then the program is simulated on a functional Itanium simulator to create p-slices7for each delinquent load Whenever a delinquent load is executed, the instructionthat had been executed 128 instructions prior to it in the dynamic execution stream

is marked as a potential basic trigger This is achieved by keeping the most recent

256 retired instructions in a buffer and looking it up for the 128th instruction.The next few times that this potential trigger is executed, the instruction stream

is observed to verify that the same delinquent load is executed somewhere withinthe next 256 instructions If the potential trigger consistently fails to lead to thedelinquent load, it is discarded Otherwise, if the trigger consistently leads to thedelinquent load, the trigger is confirmed and the backward slice of instructionsbetween the delinquent load and the trigger is captured Instructions between thetrigger and the delinquent load constitute potential instructions for constructingthe p-slice Those unnecessary to compute the address are eliminated

In addition to these basic triggers, they use chaining triggers, which allows onespeculative thread to explicitly spawn another speculative thread A key featurefor applying chaining triggers is the presence of stride in addresses consumed by aload that is a dynamic invariant whose value is fixed for the duration of the loop.Thus p-slices containing chaining triggers typically have three parts - a prologue,

a spawn instruction for spawning another copy of this p-slice and an epilogue

Most of the thread based techniques differ only in the way threads are created andhow they are triggered On the one end, researches [17] propose a source-to-source

C compiler that extracts p-slices, reducing the dynamic hardware required Onthe other end, in long range prefetching technique [15], p-threads are constructedspawned, improved upon, evaluated and possibly even removed, entirely by hard-ware In either case, some amount of hardware support is required, in the form of

6 Loads that have the largest impact on performance.

7 Precomputation slices

Trang 28

A combination of hardware and software techniques was used by Abraham et.al.[26] to predict the latencies of load/store instructions and subsequently use them

to improve performance of the application This method requires that the ISAhave instructions that permit the software to manage the cache, e.g., DEC Alpha

In addition to the standard load/store operations, the architecture needs to vide explicit control over the memory hierarchy For example, there could be twomodifiers associated with each load operation specifying which level in the memoryhierarchy is this load is likely to be found and another to specify which level theloaded value should be placed These hardware support are becoming increasinglycommon in commercial microprocessors In this work, they use profiling to get thememory referencing behavior of individual machine-level instructions The infor-mation gained by the compiler through profiling can be passed on to the hardware

pro-by annotating the instructions, viz adding values to these modifiers If the piler is unable to gain this information, these modifiers are set to a special nta9

com-value, which specifies that no information is available This allows for a mixedcompiler/hardware control over the cache hierarchy where the compiler interferesonly if it has some insight into the program behavior

8 They are precomputation based not prediction based

9 Not available

Trang 29

2.3 Application Restructuring 20

Instead of using either hardware or software methods to effect a prefetch, there aretechniques that have been proposed for restructuring the program to modify itscache behavior One such methodology is detailed below

A method of creating and utilizing the cache hit/miss heuristics and utilizing that

in the amelioration of memory latency bottleneck was introduced by Toshihiroet.al [29] In this work, they have developed simple compiler heuristics to iden-tify load instructions that are likely to cause a cache miss Firstly, the loads areclassified into either list accesses, stride accesses or others List access refers to aload instruction whose load address comes from another load instruction, which

is typical of pointer-chasing Stride access refers to loads in a loop with constant

or variable address increment For every load that falls into either one of thesetwo classes, there is a high probability of a cache miss Hence the compiler tries

to insert sufficient instructions between the selected load instruction and tions that use the loaded data by one of the following three ways: selected loadinstruction and its address calculation are moved up or the instruction that usesthe loaded data and its dependents are moved down or instructions not related

instruc-to this load are moved between the load and its use These moves are allowed instruc-tocross basic block boundaries This, in effect, would reduce the stalls due to theload since there are computations inserted in between, which are independent ofthe load

All the above said methods fall short of the proposed PEPSE, which

Trang 30

2.4 Limitations 21

• Provides a unified framework for prefetching in both scientific and intensive applications using well known concepts of speculative execution andProgram Dependence Graph(PDG)

pointer-• Ensures accurate and timely precomputation of the load addresses and hencedoes not issue unnecessary prefetches

• Does not require any special hardware to implement

• Has little resource overhead, since it utilizes the available unutilized resources

in the architecture

Trang 31

Chapter 3

LDG and PEPSE

In this chapter we elaborate on our proposed methodology First, we explain theconcept of Load Dependence Graph(LDG) Then we explain the Program Embed-ded Precomputation using Speculative Execution(PEPSE), our technique to embedthe speculative program slices along with the original program Throughout thischapter, we assume that the reader is familiar with the standard control and dataflow analysis techniques

The concept of Program Dependence Graph is well established in the compilerarena At compile time, validity of operations are governed by the dependenciesthat need to be followed If a transformation would disrupt a dependence, then

it would not be allowed A typical compiler would construct the data and controldependence graphs before it begins optimizing code, as these graphs are essentialfor verifying if certain transformations are possible on the code In the followingsubsections, we show how the concept of PDG can be used to extract the subset

of a program which computes the address of a load, the Load Dependence Graph

22

Trang 32

3.1 Load Dependence Graph 23

Callahan et.al [5] show that, on an average, an application spends about one-third

of its execution time waiting for cache miss(for a memory latency of about 50cycles) The current trends in processor design increases this even further Also,they [5] observe that a small percentage of the references cause majority of themisses in the programs To validate these claims, so that we could focus ouroptimizations to only a few delinquent loads in a program, we profiled variousprograms to find the number of loads that account for more than 90% of themisses Empirically, we modelled different memory system architectures includingthe Pentium4, Itanium and Itanium 2, and we overwhelmingly found that a verysmall number of load instructions cause more than 90% of the data stalls incurred

by the processor The results are shown in Table 3.1 This characteristic allows

us to focus the memory system optimizations to a small subset of the total loadinstructions in the program

Our framework identifies the delinquent loads in a program using profiling, a nique that is becoming popular in feedback driven optimizations We generate theprofile information by instrumenting the code generated by ORC to couple it withthe Dinero IV cache simulator [10] The simulator allows various parameters ofeach cache to be set separately (architecture, policy, statistics) During initializa-tion, the configuration to be simulated is built up, one cache at a time, startingwith each memory as a special case After initialization, each reference is fed tothe appropriate top-level cache by a single simple function call Lower levels ofthe hierarchy are handled automatically The simulator is trace driven, viz., itworks on the traces of memory accesses generated by the program The loads inthe program are identified with the help of a centralized identifier generator whichinitializes a new identifier for all the memory operations in the program Thisidentifier along with the reference address are passed as parameters to the cache

Trang 33

tech-3.1 Load Dependence Graph 24

Trang 34

simulator

When the instrumented code is run with the simulator, it produces the statistics

of the hits and misses of the program in the memory hierarchy For each load, wecompute the total stall cycles caused by that load,

T otal Stall Cycles =X

n

where latencyn is the latency of a particular cache level/main memory This givesthe total performance degradation of the application due to this load After sortingthe loads according to their total stall cycles, we pick up the top 5% of them forour analysis

Since our methodology is profile driven, we recognize the importance of addressingthe issue of profile sensitivity to different input workloads This is to check ifthe set of delinquent loads for an application remain relatively constant acrossdifferent inputs For our work, we used the distributed training input(train) toprofile applications All our reported results in the later sections are collectedusing the ref input set(ref ) Though we would expect the set of delinquent loads

to be dependent on the workloads distributed with the program and also on theprogram’s characteristics, we have observed that the set of delinquent loads doesnot vary much among the different input workloads

We use the concept of PDG to create Load Dependence Graph(LDG), which is aprogram slice of the set of instructions that contribute to the address calculationfor the load instruction The LDG creation starts with the delinquent load andmoves up, including any instruction that produces results that any of the existingLDG instructions is dependent on

Trang 35

Ideally, the last instruction of the LDG(the prefetch instruction) should be initiated

δ cycles before the actual load is encountered, where δ is the average latency ofthe load instruction This would prefetch the address just in time for the loadinstruction But to achieve that, the LDG has to be started δ + α ahead of theload, where α is the schedule length of the LDG This may not always be possiblebecause the LDG creation would have to be stopped if one the following happens

• The LDG creation encounters a function call Inter procedural analysis isbeyond the scope of this work, though it remains an interesting topic toexplore Since we cannot determine the effect of the procedure call on theLDG instructions, we stop the LDG creation

• The length of LDG increases beyond a predefined limit This would ensurethat the program embedding of speculative LDG instructions does not dras-tically increase the static length of the program

• When the current block is the first region or if all the predecessor blocks arevisited, then the LDG creation is stopped

If the LDG creation has to be stopped prematurely because of one of the abovereasons, then the insertion of LDG would not be able to fully absolve the loadlatency But, it is still effective in reducing the latency of the load instruction

While building the LDG, the LDG creation algorithm is allowed to cross basicblock boundaries In this case, a path specific LDG would have to be created foreach of the incoming paths Without some kind of path profiling and pruning,the number of path-specific LDGs would be excessively large For this, we use thebranch profile and create path-specific LDGs only for incoming edges with atleast20% edge frequency, meaning that a branch edge must have been taken atleast20% of the time to be considered for a path-specific LDG

Định dạng
Số trang	71
Dung lượng	307,87 KB