Adaptive Techniques for Dynamic Processor Optimization Theory and Practice Episode 2 Part 5 pot

D-type flip-flops for simplicity of design; if this is too expensive then transparent latches may be used, typically using a two-phase, non-overlapping clock with alternating stages on o

Trang 1

This adaptivity to environmental conditions means that voltage scaling

on self-timed circuits is trivially easy to manage All that is needed is to vary the voltage; the operating speed and power will adapt automatically Similarly, the circuits will slow down if they become hot, but they will still function correctly This has been demonstrated repeatedly with experimen-tal asynchronous designs

A great deal is said about voltage scaling elsewhere in this book, so it is sufficient here to note that most of the complexity of voltage scaling is in the clock control system, which ceases to be an issue when there is no clock to control! Instead, this chapter concentrates on other techniques which are facilitated by the asynchronous style

10.3 Asynchronous Adaptation to Workload

Power – or, rather, energy – efficiency is important in many processing

applications As described elsewhere, one way of reducing the power con-sumption of a processor is reducing the clock (or instruction) frequency, and energy efficiency may then also be improved by lowering the supply voltage Of course, if the processor is doing nothing useful, the energy ef-ficiency is very poor, and in this circumstance, it is best to run as few in-structions as possible In the limit, the clock is stopped and the processor

‘sleeps’, pending a wake-up event such as an interrupt Synchronous proc-essors sometimes have different sleep modes, including gating the clock off but keeping the PLL running, shutting down the PLL, and turning the power off The first of these still consumes noticeable power but allows rapid restart; the second is more economical but takes considerable time to restart as the PLL must be allowed to stabilise before the clock is used This is undesirable if, for example, all that is required is the servicing of interrupts in a real-time system It is a software decision as to which of these modes to adopt; needless to say this software also imposes an energy overhead

An asynchronous processor has fewer modes If the processor is pow-ered it is either running as fast as it can under the prevailing environmental conditions or stalled waiting for some input or output Because there is no external clock, if one subsystem is caused to stall any stage, waiting for its outputs will stall soon afterwards, as will stages trying to send input to it

In this way, a single gate anywhere in the system can rapidly bring the whole system to a halt For example, Figure 10.2 shows an asynchronous processor pipeline filling from the prefetch unit; here the system is halted

by a ‘HALT’ operation reaching the execution stage, at which point the

Trang 2

preceding pipeline fills up and stalls while the subsequent stages stall be-cause they are starved of input When the halt is rescinded, the system will resume where it left off and come to full speed almost instantaneously Thus, power management is extremely easy to implement and requires al-most no software control

Figure 10.2 Processor pipeline halting in execution stage

In the Amulet processors, a halt instruction was retrofitted to the ARM instruction set [12] by detecting an instruction which branches to itself This is a common way to implement an idle task on the ARM and causes

Trang 3

the processor to ‘spin’ until it is interrupted, burning power to no effect In Amulet2 and Amulet3, this instruction causes a local stall which rapidly propagates throughout the system, reducing the dynamic power to zero An interrupt simply releases the stall condition, causing the processor to re-sume and recognise the interrupt This halt implementation is transparent –

as the effect of stopping is not distinguishable from the effect of repeating

an instruction which does not alter any data – except in the power con-sumption

Perhaps the most useful consequence of asynchronous systems only processing data on demand is that this results in power savings throughout the system If a multiplier (for example) is not in use, it is not ‘clocked’ and therefore dissipates no dynamic power This can be true of any subsys-tem, but it is particularly important in infrequently used blocks

10.4 Data-Dependent Timing

A well-engineered synchronous pipeline will usually be ‘balanced’ so that the critical path in each stage is approximately the same length This al-lows the circuit to be clocked at its maximum frequency, without perform-ance being wasted as a result of potentially faster stages being slowed to the common clock rate Good engineering is not easy, and considerable ef-fort may need to be expended to achieve this

The same principle holds in a self-timed system although the design con-straints are different A self-timed pipeline will find its own operating speed

in a similar fashion to traffic in a road system; a queue will form upstream of

a choke point and be sparser downstream In a simulation, this makes it clear where further design attention is required; this is usually – but not always – the slowest stage One reason why a particularly slow stage may not slow the whole system is that it is on a ‘back road’ with very little traffic There is

no requirement to process each operation in a fixed period, so the system may adapt to its operating conditions Here are some examples:

• In a memory system, some parts may go faster than others; cache memories rely on this property and can be exploited even in synchro-nous systems as a cache miss will stall for multiple clock cycles waiting

Of course, it is possible to gate clocks to mimic this effect, but clock gating can easily introduce timing compatibility problems and is certainly something which needs careful attention by the designer Asynchronous design delivers an optimal ‘clock gating’ system without any additional ef-fort on the part of the designer

Trang 4

for an external response This is the ‘natural’ behaviour of an asynchro-nous memory where response is a single ‘cycle’ but the length of the cycle is varied according to need An advantage in the asynchronous system is that it is easier to vary more parameters, and these can be al-tered in more ‘subtle’ ways than simply in discrete multiples of clock cycles

• It is possible to exploit data dependency at a finer level Additions are slow because of carry propagation To speed them up requires consider-able effort, and hence hardware, and hence energy is typically expended

in fast carry logic of some form This ensures that the critical path – propagating a carry from the least to the most significant bit position –

is as short as possible However, operations which require long carry propagation distances are comparatively rare; the effort, hardware, and power are expended on something which is rarely used Given random

operands, the longest carry chain in an N-bit adder is O(N), but the

av-erage length is O(log2(N)); for a 32-bit adder the longest is about 6× the average If a variable-length cycle is possible, then a simple, energy-efficient, ripple-carry adder can produce a correct result in a time com-parable to a much larger (more expensive, power-consuming) adder

• Not all operations take the same evaluation time: some operation evaluation is data dependent A simple example is a processor’s ALU operation which typically may include options to MOVE, AND, ADD

or MULTIPLY operands A MOVE is a fast operation and an AND, be-ing a bitwise operation, is a similar speed ADDs, however, are ham-pered by the need to propagate carries across the datapath and therefore are considerably slower Multiplication, comprising repeated addition, is

of course slower still A typical synchronous ALU will probably set its critical path to the ADD operation and accept the inefficiency in the MOVE Multiplication may then require multiple clock cycles, with a consequent pipeline stall, or be moved to a separate subsystem An asynchronous ALU can accommodate all of these operations in a single cycle by varying the length of the cycle This simplifies the higher-level design – any stalls are implicit – and allows faster operations to com-plete faster It is sometimes said that self-timed systems can thus deliver

‘average case performance’; in practice, this is not true because it is likely that the operation subsequent to a fast operation like MOVE will not reach the unit immediately it is free, or the fast operation could be stalled waiting for a previous operation to complete Having a 50:50 mixture of 60mph cars and 20mph tractors does not mean the traffic flows at 40mph! However, if the slow operations are quite rare – such as multiplication in much code – then the traffic can flow at close to full speed most of the time while the overall model remains simple

Trang 5

Unfortunately, this is not the whole story because there is an overhead

in detecting the carry completion and, in any case, ‘real’ additions do not use purely random operands [13] Nevertheless, a much cheaper unit can supply respectable performance by adapting its timing to the oper-ands on each cycle In particular, an incrementer, such as is used for the programme counter, can be built very efficiently using this principle

• At a higher level, it is possible to run different subsystems deliberately

at different rates As a final example, the top level of the memory sys-tem for Amulet3 is – as on many modern processors – split across sepa-rate instruction and data buses to allow parallelism of access [14] Here these buses run to a unified local memory which is internally partitioned into interleaved blocks Provided two accesses do not ‘collide’, these buses run independently at their own rates, and the bandwidth of the more heavily loaded instruction bus – which is simpler because it can only perform read operations – is somewhat higher than that of the read/write, multi-master data bus In the event that two accesses collide

in a single block, the later-arriving bus cycle is simply stretched to ac-commodate the extra latency Adaptability here gives the designer free-dom: slowing the instruction bus down to match the two speeds would result in lower performance, as would slowing the data bus to exactly half the instruction bus speed

The flexibility of asynchronous systems allows a considerable degree of modularity in those systems’ development Provided interfaces are com-patible, it is possible to assemble systems and be confident that they will not suffer from timing-closure problems – a fact which has been known for some time [15] It would be nice to say that such systems would always work correctly! Unfortunately, this is not the case as, as in any complex asynchronous system, it is possible to engineer in deadlocks; it is only

tim-ing incompatibilities which are eliminated Where this is exploitable is in

altering or upgrading systems where a module – such as a multiplier – can

be replaced with a compatible unit with different properties (e.g higher speed or smaller area) with confidence that the system will not need exten-sive resimulation and recharacterisation

Perhaps the most important area to emerge from this is at a higher level, i.e in Systems-on-Chip (SoCs) using a GALS (Globally Asynchronous, Locally Synchronous) approach [16] Here conventional clocked IP blocks are connected via an asynchronous fabric, effectively eliminating the timing-closure problems at the chip level – at least from a functional view-point This can represent a considerable time-saving for the ASIC designer

Trang 6

10.5 Architectural Variation in Asynchronous Systems

A pipelined architecture requires a succession of state-holding elements to capture the output from one stage and hold it for the next In a synchronous architecture, these pipeline registers may be edge triggered (i.e D-type flip-flops) for simplicity of design; if this is too expensive then transparent latches may be used, typically using a two-phase, non-overlapping clock with alternating stages on opposite phases The use of transparent latches has largely been driven out in recent times by the need to accommodate the limitations of synthesis and static timing analysis tools in high-productivity design flows, so the more expensive and power-hungry edge-triggered reg-isters have come to dominate current design practice

10.5.1 Adapting the Latch Style

In some self-timed designs (e.g dual-rail), the latches may be closely as-sociated with the control circuits; however, a bundled-data datapath closely resembles its synchronous counterpart Because data is not trans-ferred ‘simultaneously’ in all parts of the system, the simplicity (cheap-ness) of transparent latches is usually the preferred option Here the

‘downstream’ latch closes and then allows the ‘upstream’ latch to open at any subsequent time This operation can be seen in Figure 10.3 where transparent latches are unshaded and closed latches shaded

Here there is a design trade-off between speed and power Figure 10.3 depicts an asynchronous pipeline in which the latches are ‘normally open’ – i.e when the pipeline is empty all its latches are transparent; at the start the system thus looks like a block of combinatorial logic As data flows through, the latches close behind it to hold it stable (or, put another way, to delay subsequent changes) and then open again when the next stage has captured its inputs In the figure this is seen as a wave of activity as down-stream latches close and, subsequently, the preceding latch opens again When the pipeline is empty (clear road ahead!), this model allows data to flow at a higher speed than is possible in a synchronous pipeline because the pipeline latency is the sum of the critical paths in the stages rather than the multiple of the worst-case critical path and the pipeline depth

The price of this approach is a potential increase in power consumption The data ‘wave front’ will tend to skew as it flows through the logic, which can cause the input of a gate to change more times than it would if the wave front were re-aligned at every stage This introduces glitches into the data which result in wasted energy due to the spurious transitions which can propagate considerable distances

Trang 7

Figure 10.3 Pipeline with ‘normally open’ latches Open latches are unshaded;

closed latches are shaded

To prevent glitch propagation, the pipeline can adopt a ‘normally closed’ architecture (Figure 10.4) In this approach, the latches in an empty pipe-line remain closed until the data signalls its arrival, at which point they open briefly to ‘snap up’ the inputs The wave of activity is therefore visi-ble as a succession of briefly transparent latches (unshaded in the figure)

Trang 8

Figure 10.4 Pipeline with ‘normally closed’ latches Open latches are unshaded;

closed latches are shaded

Their outputs therefore change nearly simultaneously, re-aligning the data wave front and reducing the chance of glitching in the subsequent stage The disadvantage of this approach is that data propagation is slowed wait-ing for latches, which are not retainwait-ing anythwait-ing useful, to open

These styles of latch control can be mixed freely The designer has the option of increased speed or reduced power If the pipeline is filled to its maximum capacity, the decision is immaterial because the two behaviours can be shown to converge However, in other circumstances a choice has

to be made This allows some adaptivity to the application at design time, but the principle can be extended so that this choice can be made dynami-cally according to the system’s loading

Trang 9

Figure 10.5 Configurable asynchronous latch controller

The two latch controllers can be very similar in design – so much so that

a single additional input (two or four additional transistors, depending on starting point) can be used to convert one to the other (Figure 10.5) Fur-thermore, provided the change is made at a ‘safe’ time in the cycle, this in-put can be switched dynamically Thus, an asynchronous pipeline can be equipped with both ‘sport’ and ‘economy’ modes of operation using

‘Turbo latches’ [17]

The effectiveness of using normally closed latches for energy conserva-tion has been investigated in a bundled-data environment; the result de-pends strongly on both the pipeline occupancy and, as might be expected, the variation in the values of the bits flowing down the datapath

The least favourable case is when the pipeline is fully occupied, when even a normally open latch will typically not open until about the time that new data is arriving; in this case, there is no energy wastage due to the propagation of earlier values In the ‘best’ case, with uncorrelated input data and low pipeline occupancy, an energy saving of ~20% can be achieved at a price of ~10% performance, or vice versa

10.5.2 Controlling the Pipeline Occupancy

In the foregoing, it has tacitly been assumed that processing is handled in pipelines Some applications, particularly those processing streaming data, naturally map onto deep pipelines Others, such as processors, are more problematic because a branch instruction may force a pipeline flush and any speculatively fetched instructions will then be discarded, wasting en-ergy However, it is generally not possible to achieve high performance without employing pipelining

Trang 10

Figure 10.6 Occupancy throttling using token return mechanism

In a synchronous processor, the speculation depth is effectively set by the microarchitecture It is possible to leave stages ‘empty’, but there is no great benefit in doing so as the registers are still clocked In an asynchro-nous processor, latches with nothing to do are not ‘clocked’, so it is sensi-bly possible to throttle the input to leave gaps between instruction packets and thus reduce speculation, albeit at a significant performance cost This can be done, for example, when it is known that a low processing load is required or, alternatively, if it is known that the available energy supply is limited Various mechanisms are possible: a simple throttle can be imple-mented by requiring instruction packets to carry a ‘token’ through the pipeline, collecting it at fetch time and recycling it when they are retired (Figure 10.6) For full-speed operation, there must be at least as many kens as there are pipeline stages so that no instruction has to wait for a to-ken and flow is limited purely by the speed of the processing circuits However, to limit flow, some of the tokens (in the return pipeline) can be removed, thus imposing an upper limit on pipeline occupancy This limit can be controlled dynamically, reducing speculation and thereby cutting power as the environment demands

An added bonus to this scheme is that if speculation is sufficiently lim-ited, other power-hungry circuits such as branch prediction can be disabled without further performance penalty

10.5.3 Reconfiguring the Microarchitecture

Turbo latches can alter the behaviour of an asynchronous pipeline, but they are still latches and still divide the pipeline up into stages which are fixed

in the architecture However, in an asynchronous system adaptability can

be extended further; even the stage sizes can be altered dynamically!

Định dạng
Số trang	20
Dung lượng	0,93 MB