Adaptive Techniques for Dynamic Processor Optimization Theory and Practice by Alice Wang and Samuel Naffziger_14 pptx

The least favourable case is when the pipeline is fully occupied, when even a normally open latch will typically not open until about the time that new data is arriving; in this case, th

Trang 1

Figure 10.4 Pipeline with ‘normally closed’ latches Open latches are unshaded;

closed latches are shaded

Their outputs therefore change nearly simultaneously, re-aligning the data wave front and reducing the chance of glitching in the subsequent stage The disadvantage of this approach is that data propagation is slowed wait-ing for latches, which are not retainwait-ing anythwait-ing useful, to open

These styles of latch control can be mixed freely The designer has the option of increased speed or reduced power If the pipeline is filled to its maximum capacity, the decision is immaterial because the two behaviours can be shown to converge However, in other circumstances a choice has

to be made This allows some adaptivity to the application at design time, but the principle can be extended so that this choice can be made dynami-cally according to the system’s loading

Trang 2

Figure 10.5 Configurable asynchronous latch controller

The two latch controllers can be very similar in design – so much so that

a single additional input (two or four additional transistors, depending on starting point) can be used to convert one to the other (Figure 10.5) Fur-thermore, provided the change is made at a ‘safe’ time in the cycle, this in-put can be switched dynamically Thus, an asynchronous pipeline can be equipped with both ‘sport’ and ‘economy’ modes of operation using

‘Turbo latches’ [17]

The effectiveness of using normally closed latches for energy conserva-tion has been investigated in a bundled-data environment; the result de-pends strongly on both the pipeline occupancy and, as might be expected, the variation in the values of the bits flowing down the datapath

The least favourable case is when the pipeline is fully occupied, when even a normally open latch will typically not open until about the time that new data is arriving; in this case, there is no energy wastage due to the propagation of earlier values In the ‘best’ case, with uncorrelated input data and low pipeline occupancy, an energy saving of ~20% can be achieved at a price of ~10% performance, or vice versa

10.5.2 Controlling the Pipeline Occupancy

In the foregoing, it has tacitly been assumed that processing is handled in pipelines Some applications, particularly those processing streaming data, naturally map onto deep pipelines Others, such as processors, are more problematic because a branch instruction may force a pipeline flush and any speculatively fetched instructions will then be discarded, wasting en-ergy However, it is generally not possible to achieve high performance without employing pipelining

Trang 3

Figure 10.6 Occupancy throttling using token return mechanism

In a synchronous processor, the speculation depth is effectively set by the microarchitecture It is possible to leave stages ‘empty’, but there is no great benefit in doing so as the registers are still clocked In an asynchro-nous processor, latches with nothing to do are not ‘clocked’, so it is sensi-bly possible to throttle the input to leave gaps between instruction packets and thus reduce speculation, albeit at a significant performance cost This can be done, for example, when it is known that a low processing load is required or, alternatively, if it is known that the available energy supply is limited Various mechanisms are possible: a simple throttle can be imple-mented by requiring instruction packets to carry a ‘token’ through the pipeline, collecting it at fetch time and recycling it when they are retired (Figure 10.6) For full-speed operation, there must be at least as many kens as there are pipeline stages so that no instruction has to wait for a to-ken and flow is limited purely by the speed of the processing circuits However, to limit flow, some of the tokens (in the return pipeline) can be removed, thus imposing an upper limit on pipeline occupancy This limit can be controlled dynamically, reducing speculation and thereby cutting power as the environment demands

An added bonus to this scheme is that if speculation is sufficiently lim-ited, other power-hungry circuits such as branch prediction can be disabled without further performance penalty

10.5.3 Reconfiguring the Microarchitecture

Turbo latches can alter the behaviour of an asynchronous pipeline, but they are still latches and still divide the pipeline up into stages which are fixed

in the architecture However, in an asynchronous system adaptability can

be extended further; even the stage sizes can be altered dynamically!

Trang 4

A ‘normally open’ asynchronous stage works in this manner:

1 Wait for the stage to be ready and the arrival of data at the input latch;

2 Close the input latch;

3 Process the data;

4 Close the output latch;

5 Signal acknowledgement;

6 Open the input latch

Such latching stages operate in sequence, with the whole task being parti-tioned in an arbitrary manner

If another latch was present halfway through data processing (step 3, above), this would subdivide the stage and produce the acknowledgement earlier than otherwise The second half of the processing could then con-tinue in parallel with the recovery of the earlier part of the stage, which would then be able to accept new data sooner The intermediate latch would reopen again when the downstream acknowledgement (step 5, above) reached it, ready to accept the next packet This process has subdi-vided what was one pipeline stage into two, potentially providing a near doubling in throughput at the cost of some extra energy in opening and closing the intermediate latch

In an asynchronous pipeline, interactions are always local and it is

pos-sible to alter the pipeline depth during operation knowing that the rest of

the system will accommodate the change It is possible to tag each data packet with information to control the latch behaviour When a packet reaches a latch, it is forced into local synchronisation with that stage In-stead of closing and acknowledging the packet the controller can simply pass it through by keeping the latch transparent and forwarding the control signal No acknowledgement is generated; this will be passed back when it appears from the subsequent stage In this manner, a pipeline latch can be removed from the system, altering the microarchitecture in a fundamental way In Figure 10.7, packet ‘B’ does not close – and therefore ‘eliminates’ – the central latch; this and subsequent operations are slower but save on switching the high-capacitance latch enable

Of course, this change is reversible; a latch which has been deactivated can spot a reactivation command flowing through and close, reinstating the

‘missing’ stage in the pipeline In Figure 10.8, packet ‘D’ restores the cen-tral latch allowing the next packet to begin processing despite the fact that (in this case) packet ‘C’ appears to have stalled

Why might this be useful? The technique has been analysed in a proces-sor model using a range of benchmarks [18–20] As might be expected, collapsing latches and combining pipeline stages – in what was, initially, a reasonably balanced pipeline – reduces overall throughput by, typically,

Trang 5

50–100% Energy savings are more variable: streaming data applications that contain few branches show no great benefit; more ‘typical’ micro-processor applications with more branches exhibit ~10% energy savings and, as might be expected, the performance penalty is at the lower end of the range If this technique is to prove useful, it is certainly one which needs to be used carefully and applied dynamically, possibly under soft-ware control; however, it can provide benefits and is another tool available

to the designer

Figure 10.7 Pipeline collapsing and losing latch stage

Figure 10.8 Pipeline expanding and reinstating latch stage

Trang 6

10.6 Benefits of Asynchronous Design

Asynchronous operation brings diverse benefits to microprocessors, but these are in general hard to quantify Unequivocal comparisons with clocked processors are few and far between Part of the difficulty lies in the fact that there are many ways to build microprocessors without clocks, each offering its own trade-offs in terms of performance, power efficiency, adaptability, and so on Exploration of asynchronous territory has been far less extensive than that of the clocked domain, so we can at this stage only point to specific exemplars to see how asynchronous design can work out

in practice

The Amulet processor series demonstrated the feasibility, technical merit, and commercial viability of asynchronous processors These full-custom designs showed that asynchronous processor cores can be competi-tive with clocked processors in terms of area and performance, with dra-matically reduced electromagnetic emissions They also demonstrated modest power savings under heavy processing loads, with greatly simpli-fied power management and greater power savings under variable event-driven workloads

The Philips asynchronous 80C51 [7] has enjoyed considerable commer-cial success, demonstrating good power efficiency and very low electro-magnetic emissions It is a synthesised processor, showing that asynchro-nous synthesis is a viable route to an effective microprocessor, at least at lower performance levels

The ARM996HS [8], developed in collaboration between ARM Ltd and Handshake Solutions, is a synthesised asynchronous ARM9 core available

as a licensable IP core with better power efficiency (albeit at lower per-formance) than the clocked ARM9 cores It demonstrated low current peaks and very low electromagnetic emissions and is robust against cur-rent, voltage, and temperature variations due to the intrinsic ability of the asynchronous technology to adapt to changing environmental conditions All of the above designs employ conventional instruction set architec-tures and have implemented these in an asynchronous framework while maintaining a high degree of compatibility with their clocked predeces-sors This compatibility makes comparison relatively straightforward, but may constrain the asynchronous design in ways that limit its potential More radical asynchronous designs have been conceived that owe less to the heritage of clocked processors, such as the Sun FLEET architecture [21], but there is still a long way to go before the comparative merits of these can be assessed quantitatively

Trang 7

10.7 Conclusion

Although almost all current microprocessor designs are based on the use of

a central clock, this is not the only viable approach Asynchronous design, which dispenses with global timing control in favour of local synchronisa-tion as and when required, introduces several potential degrees of adapta-tion that are not readily available to the clocked system Asynchronous cir-cuits intrinsically adapt to variations in supply voltage (making dynamic voltage scaling very straightforward), temperature, process variability, crosstalk, and so on They can adapt to varying processing requirements, in particular enabling highly efficient event-driven, real-time systems They can adapt to varying data workloads, allowing hardware resources to be optimised for typical rather than very rare operand values, and they can adapt very flexibly (and continuously, rather than in discrete steps) to vari-able memory response times In addition, asynchronous processor mi-croarchitectures can adapt to operating conditions by varying their funda-mental pipeline behaviour and effective pipeline depth

The flexibility and adaptability of asynchronous microprocessors make them highly suited to a future that holds the promise of increasing device variability There remain issues relating to design tool support for asyn-chronous design, and a limited resource of engineers skilled in the art, but the option of global synchronisation faces increasing difficulties, at least some of which can be ameliorated through the use of asynchronous design techniques We live in interesting times for the asynchronous microproces-sor; only time will tell how the balance of forces will ultimately resolve

References

[1] A.J Martin, S.M Burns, T.K Lee, D Borkovic and P.J Hazewindus, “The Design of an Asynchronous Microprocessor”, ARVLSI: Decennial Caltech Conference on VLSI, ed C.L Seitz, MIT Press, 1989, pp 351–373

[2] S.B Furber, P Day, J.D Garside, N.C Paver and J.V Woods, “AMULET1:

A Micropipelined ARM”, Proceedings of CompCon'94, IEEE Computer So-ciety Press, San Francisco, March 1994, pp.476–485

[3] A Takamura, M Kuwako, M Imai, T Fujii, M Ozawa, I Fukasaku, Y Ueno and T Nanya, “TITAC-2: A 32-Bit Asynchronous Microprocessor Based on Scalable-Delay-Insensitive Model”, Proceedings of ICCD'97, Oc-tober 1997, pp 288–294

[4] M Renaudin, P Vivet and F Robin, “ASPRO-216: A Standard-Cell Q.D.I 16-Bit RISC Asynchronous Microprocessor”, Proceedings of Async'98, IEEE Computer Society, 1998, pp 22–31 ISBN:0-8186-8392-9

Trang 8

[5] S.B Furber, J.D Garside and D.A Gilbert, “AMULET3: A High-Performance Self-Timed ARM Microprocessor”, Proceedings of ICCD'98, Austin, TX, 5–7 October 1998, pp 247–252 ISBN 0-8186-9099-2

[6] S.B Furber, A Efthymiou, J.D Garside, M.J.G Lewis, D.W Lloyd and S Temple, “Power Management in the AMULET Microprocessors”, IEEE De-sign and Test of Computers, ed E Macii, March–April 2001, Vol 18, No 2,

pp 42–52 ISSN: 0740-7475

[8] A Bink and R York, “ARM996HS: The First Licensable, Clockless 32-Bit Processor Core”, IEEE Micro, March 2007, Vol 27, No 2, pp 58–68 ISSN: 0272-1732

[9] I Sutherland, “Micropipelines”, Communications of the ACM, June 1989, Vol 32, No 6, pp.720–738 ISSN: 0001-0782

[10] J Sparsø and S Furber (eds.), “Principles of Asynchronous Circuit Design –

A Systems Perspective”, Kluwer Academic Publishers, 2002 ISBN-10:

0792376137 ISBN-13: 978-0792376132

[11] S.B Furber, D.A Edwards and J.D Garside, “AMULET3: A 100 MIPS Asynchronous Embedded Processor”, Proceedings of ICCD'00, 17–20 Sep-tember 2000

[12] D Seal (ed.), “ARM Architecture Reference Manual (Second Edition)”, Ad-dison-Wesley, 2000 ISBN-10: 0201737191 ISBN-13: 978-0201737196 [13] J.D Garside, “A CMOS VLSI Implementation of an Asynchronous ALU”,“Asynchronous Design Methodologies”, eds S.B Furber and M Ed-wards, Elsevier 1993, IFIP Trans A-28, pp 181–207

[14] D Hormdee and J.D Garside, “AMULET3i Cache Architecture”, Proceed-ings of Async’01, IEEE Computer Society Press, March 2001, pp 152–161 ISSN 1522-8681 ISBN 0-7695-1034-4

[15] W.A Clark, “Macromodular Computer Systems”, Proceedings of the Spring Joint Conference, AFIPS, April 1967

[16] D.M Chapiro, “Globally-Asynchronous Locally-Synchronous Systems”, Ph.D thesis, Stanford University, USA, October 1984

[17] M Lewis, J.D Garside and L.E.M Brackenbury, “Reconfigurable Latch Controllers for Low Power Asynchronous Circuits”, Proceedings of Async'99, IEEE Computer Society Press, April 1999, pp 27–35

[18] A Efthymiou, “Asynchronous Techniques for Power-Adaptive Processing”, Ph.D thesis, Department of Computer Science, University of Manchester,

UK, 2002

[19] A Efthymiou and J.D Garside, “Adaptive Pipeline Depth Control for Proc-essor Power-Management”, Proceedings of ICCD'02, Freiburg, September

2002, pp 454–457 ISBN 0-7695 1700-5 ISSN 1063-6404

[7] H van Gageldonk, K van Berkel, A Peeters, D Baumann, D Gloor and

G Stegmann, “An Asynchronous Low-Power 80C51 Microcontroller”, Pro-ceedings of Async'98, IEEE Computer Society, 1998, pp 96–107 ISBN:0-8186-8392-9

Trang 9

[21] W.S Coates, J.K Lexau, I.W Jones, S.M Fairbanks and I.E Sutherland,

“FLEETzero: An Asynchronous Switching Experiment”, Proceedings of Async'01, IEEE Computer Society, 2001, pp 173–182 ISBN:0-7695-1034-5

[20] A Efthymiou and J.D Garside, “Adaptive Pipeline Structures for Specula-tion Control”, Proceedings of Async'03, Vancouver, May 2003, pp 46–55 ISBN 0-7695-1898-2 ISSN 1522-8681

Trang 10

John J Wuu

Advanced Micro Devices, Inc

11.1 Introduction

The International Technology Roadmap for Semiconductors (ITRS) predicted in 2001 that by 2013, over 90% of SOC die area will be occupied

by memory [7] Such level of integration poses many challenges, such as power, reliability, and yield In addition, as transistor dimensions continue

to shrink, transistor threshold voltage (VT) variation, which is inversely proportional to the square root of the transistor area, continues to increase This VT variation, along with other factors contributing to overall variation, is creating difficulties in designing stable SRAM cells that meet product density and voltage requirements

This chapter examines various dynamic and adaptive techniques for mitigating some of these common challenges in SRAM design The chapter first introduces innovations at the bitslice level, which includes SRAM cells and immediate peripheral circuitry These innovations seek to improve bitcell stability and increase the read and write margins, while reducing power Next, the power reduction techniques at the array level, which generally involve cache sleeping and methods for regulating the sleep voltage, as well as schemes for taking the cache into and out of sleep are discussed Finally, the chapter examines the yield and reliability, which are issues that engineers and designers cannot overlook, especially

as caches continue to increase in size To improve reliability, one must account for test escapes, latent defects, and soft errors; thus the chapter concludes with a discussion of error correction and dynamic cache line disable or reconfiguration options

in SRAM Design

A Wang, S Naffziger (eds.), Adaptive Techniques for Dynamic Processor Optimization,

Định dạng
Số trang	19
Dung lượng	0,95 MB