Adaptive Techniques for Dynamic Processor Optimization_Theory and Practice Episode 1 Part 8 ppsx

If half of the cycles are stalls waiting for the bus, as determined by a combination of the total clock count, instructions executed, and data dependency stall or bus request counts, the

Trang 1

As a concrete example, assume the processor is running at 1 GHz and

VDD = 1.75 V If half of the cycles are stalls waiting for the bus, as determined by a combination of the total clock count, instructions executed, and data dependency stall or bus request counts, the VDD can be adjusted to 1.2 V (see Figure 6.2) and the core frequency reduced to 500 MHz Useful work is then performed in a greater number of the (fewer overall) core clock cycles Referring to Figure 6.2, the power savings is nearly 50% with the same work finished in the same amount of time

6.2 Dynamic Voltage Scaling on the XScale Microprocessor

This section describes experimental results running DVS on the 180 nm XScale microprocessor The value of DVS is evident in Figure 6.3 Here, the 80200 microprocessor is shown functioning across a power range from

10 mW in idle mode, up to 1.5 W at 1 GHz clock frequency The idle mode power is dominated by the PLL and clock generation unit The processor core includes the capacity to apply reverse-body bias and supply collapse [10, 11] to the core transistors for fully state-retentive power-down The microprocessor core consumes 100 μW in the low standby

“Drowsy” mode [12] The PLL and clock divider unit must be restarted when leaving Drowsy mode When running with a clock frequency of 200 MHz, the VDD can be reduced to 700 mV, providing power dissipation less than 45 mW

Figure 6.3 The value of dynamic voltage scaling is evident from this plot of the

80200 power and VDD voltage over time The power lags due to the latency of the

measurement and time averaging

4IME

6$$

&REQUENCY

0OWER

Trang 2

6.2.1 Running DVS

To demonstrate DVS on the XScale, a synthetic benchmark programmed using the LRH demonstration board is used here The onboard voltage regulator is bypassed, and a daughter-card using a Lattice GAL22v10 PLD controller and a Maxim MAX1855 DC-DC converter evaluation kit is added The DC–DC converter output voltage can vary from 0.6 to 1.75 V The control is memory mapped, allowing software to control the processor core VDD

The synthetic benchmark loops between a basic block of code that has a data set that fits entirely in the cache (these pages are configured for write-back mode) and one that is non-cacheable and non-bufferable The latter requires many more bus operations, since the bus frequency of 100 MHz is lower than the core clock frequency, which must be at least 3× the bus frequency on the demonstration board

The code monitors the actual operational CPI using the processor PMU The number of executed instructions as well as the number of clocks, since the PMU was initialized and counting began, are monitored The C code, with inline assembly code to perform low-level functions is

unsigned int count0, count1, count2;

int cpi() {

int val;

// read the performance counters

asm("mrc p14, 0, r0, c0, c0, 0":::"r0"); // read the PMNC register

asm("bic r1, r0, #1":::"r1"); // clear the enable bit

asm("mcr p14, 0, r1, c0, c0, 0":::"r1"); // clear interrupt flag, disable counting // read CCNT register

asm("mrc p14, 0, %0, c1, c0, 0" : "=r" (count0) : "0" (count0));

return(val = count0);

}

int startcounters() {

unsigned int z;

// set up and turn on the performance counters

z = 0 × 00710707;

asm("mov r1, %0" :: "r" (z) : "r1"); // initialization value in reg 1

asm("mcr p14, 0, r1, c0, c0, 0" ::: "r1"); // write reg 1 to PMNC

}

Trang 3

Note that the code to utilize the PMU is neither large nor complicated It

is also straightforward to implement the actual VDD and core clock rate changes To avoid creating a timing violation in the processor logic, the core voltage VDD must always be sufficient to support the core operating frequency This requires that the voltage be raised before the frequency is and conversely that the frequency be reduced before the voltage is The XScale controls the clock divider ratio from the PLL through writes to CP14 The C code to raise the VDD voltage is

int raisevoltage() {

int i;

// raise the voltage first

if (voltage <= TOP_V) { // leave it alone

printf ("V at end of range ");

}

else {

voltage ;

*voltagep = voltage;

// adjusting the frequency

if (frequency < TOP_F) { // do nothing

frequency = uf[voltage];;

asm("mov r1, %0" :: "r" (frequency) : "r1");

asm("mcr p14, 0, r1, c6, c0, 0" ::: "r1");

}

return(voltage);

}

The code to lower the voltage is very similar The supported clock multipliers range from 3 to 11 [9] The array uf[] is a lookup table of appropriate voltages for each frequency The PLD is programmed so that the highest voltage of 1.7 V is programmed by setting the value to 0 and higher values increase the voltage by 50 mV (for the first four entries) or

setting the voltage too low No delay is required, since the coprocessor register write forces the core clocks to be inactive for approximately 20 μs while the PLL relocks to the new clock fraction—this is handled automatically by the XScale core hardware Excellent power supply rejection ratio (PSRR) in the 80200 PLL allows the relock to occur in

Trang 4

parallel with the voltage movement The code to lower the voltage is similar, but as mentioned, the frequency is reduced before lowering the voltage Again, the PLL lock time, invoked before the MCR P14 instruction can finish, hides the latency of the voltage movement from the software

Figure 6.4 Simple DVS control heuristic using an estimate of the CPI as

determined by the PMU The CPI is estimated for each time slice and VDD adjusted if it is outside the dead-band parameters CPI_DB_high and

CPI_DB_low Otherwise the VDD and clock frequency are unchanged Here, for illustration purposes, the control algorithm is very simple, as shown in Figure 6.4 All but the “Execute time slice…” block would be part of the OS Behavior of the synthetic benchmark, using the code shown above, is shown in Figure 6.5(a) Many complicated and hence more optimal VDD control algorithms have been developed but are application dependent and beyond the scope of this discussion The frequency and voltage are increased by one increment if the measured CPI is below the predetermined value CPI_DB_high, and they are decreased by one increment if the CPI is above another predetermined value CPI_DB_low

It is left the same otherwise, i.e., the control dead-band is defined by the separation of the two values Figure 6.5(b) shows the intervals more closely The intervals running the bus limited data access code are marked

by A, and the faster running (cacheable data) code is marked by B The

distinct VDD voltage steps when the frequency and voltage are changed as the data accesses move from one behavior to the other are evident

Trang 5

Figure 6.5 Oscilloscope traces of VDD on the LRH test system The system is running a synthetic benchmark that modifies VDD based on the CPI as determined

by the PMU (a)–(d) The distinct steps in voltage with each software-controlled clock rate and VDD change are evident The VDD slew rate is shown in (e), where

the supply ripple can also be seen

Adjusting the size of the control heuristic dead-band to be too small causes the voltage to “hunt” when running the faster code, as evident in

Figure 6.5(c) section B since a stable CPI value between that which causes

an increase and that which causes a decrease is not found This hunting behavior is not efficient, since the PLL lock time is wasted for each 50 mV

VDD movement It is therefore important to define a large enough stable region and make DVS changes (monitor the CPI) infrequently enough to keep the total voltage change time insignificant compared to the total operating time A further adjustment in the heuristic affects the minimum usable voltage, by allowing still slower operation for the bus limited code

A

A A

B

(a) Horizontal scale: 1 s/div

Horizontal scale: 400 ms/div

Horizontal scale: 200 ms/div

Horizontal scale: 20 us/div

(b)

(c)

(d)

(e)

Trang 6

Figure 6.5(e) shows the maximum slew rate for the large voltage change from 1.0 to 1.7 V, which is the nearly vertical VDD movement near the end

of the trace in Figure 6.5(d) The core VDD is slightly over-damped, as evident in Figure 6.5(e)

6.3 Impact of DVS on Memory Blocks

As mentioned in the introduction, some circuits may limit operation at low

VDD Microprocessors and SOC ICs include numerous memories, usually implemented with six transistor SRAM cells In future devices, it is expected that memory, and SRAM in particular, will dominate IC area [13] Unfortunately, SRAM has diminishing read stability [14] as manufacturing processes are scaled down in size and transistor level variations increase [15] Lower VDD profoundly reduces SRAM read stability, making it a primary limiting circuit when applying DVS

When the SRAM is read, the low storage node rises due to the voltage divider comprised of the two series NMOS transistors in the read current path, which includes one of the storage nodes Monte Carlo simulations of SRAM static noise margin are shown in Figure 6.6 As VDD is decreased, the static noise margin (SNM) as measured by the smallest side of the square with largest diagonal in the small side of the static voltage curves (see Figure 6.6(a)) decreases as well The large transistor mismatch due to both systematic (intra-die) and random (within-die) variations cause asymmetry in the SNM plot as shown in Figure 6.6(a) An IC contains many SRAM cells, so the combination of worst-case systematic and random variations can cause some cells to fail, significantly impacting the manufacturing yield at low VDD The simulated behavior of the SRAM SNM vs voltage, using Monte Carlo device variations to 5σ, is shown in Figure 6.6(b) It is evident that the SRAM read margins are strongly affected by the combination of transistor variation and reduced VDD Register file memory, which is also ubiquitous in microprocessors and SOC Ics, does not suffer from reduced SNM when reading since the read current path does not pass through the SRAM storage nodes These memories can scale with any core logic and can in fact operate effectively well into subthreshold, i.e., they allow operation with VDD < Vth [16, 17]

6.3.1 Guaranteeing SRAM Stability with DVS

In the 180 nm process used for the XScale, the manufacturing yield is negligibly impacted by SRAM read stability, even at VDD = 0.7 V when

Trang 7

only the two 32kB caches are considered However, adding large SOC SRAMs significantly affects the IC manufacturing yield at low VDD The solution used for the 180 nm “Bulverde” application processor SOC [18] is

to scale the XScale cache circuits with the dynamically scaling core and SOC logic supply voltage, while operating the large SOC SRAM on a fixed supply [19] The SRAMs and their voltage domains are shown in Figure 6.7 The SOC logic clock rate is 104 MHz or less depending on the DVS point, while the core clock frequency scales from 104 MHz to over

Figure 6.6 SRAM SNM at various voltages (a) The mean and 5σ SNM from Monte Carlo simulations (b) show vanishing SNM at low voltages The XScale SOC logic level shifts SRAM input signals and operates the SRAMs at a constant

voltage where SNM is maintained

Trang 8

Figure 6.7 SRAMs and their voltage domains in the XScale core and in

the Bulverde application processor [20] This diagram is greatly simplified

to emphasize the DVS vs constant VDD domains

500 MHz [18, 20] A constant 1.1 V SRAM power supply voltage (VDDSRAM) provides adequate access times for the slower SOC logic In this manner, the SOC and microprocessor core logic VDD employ DVS, but the embedded SOC SRAM supply VDDSRAM is fixed The fixed, higher minimum VDD for the additional SOC SRAMs assures high manufacturing yield with a low minimum VDD for DVS The fixed SRAM supply voltage also facilitates the low standby power Drowsy modes, which have a single optimal VDD that must be sufficient to allow raising the NMOS transistor source nodes toward VDD to apply NMOS body bias [11]

With two differing supply voltages, level shifting is required between the memories and the SOC logic The added level shifters degrade the maximum performance, since they add delay This is not an issue for low

VDD operation—the higher SRAM VDD makes them fast compared to the surrounding logic operating at lower VDD The problem is that the level shifters slow the maximum clock rate of the design at high VDD by injecting extra delay in the memory access path

The Bulverde SOC memory level shifting scheme is shown in Figure 6.8(a) To minimize the number of level shifters and limit the complexity, the address ADD(1:m) and some control signal voltages are translated to the different VDDSRAM power supply domain by the cross-coupled level shifting circuit evident at the decoder inputs This scheme has the drawback that the word-line enable signal WLE, which is essentially a clock, and the array pre-charge signal PRECHN must be level

Trang 9

shifted The write and read column multiplexer control signals must also

be level shifted—for clarity, these circuits are not shown in the figure The differential sense amplifiers, which operate at the (potentially lower) DVS domain supply voltage, automatically shift the SRAM outputs OUTDATA

to the correct voltage range The sense timing signal SAE is also in the DVS domain

Figure 6.8 Level shifting paths to allow the SRAM supply voltage VDDSRAM to remain constant while applying DVS to the surrounding logic In (a) the level shifters are placed at the SRAM block interface, while in (b) the level shifters are

at the storage array interface In both cases, the sense amplifiers shift back to the

DVS domain

Trang 10

Additional power can be saved by the scheme shown in Figure 6.8(b), which shifts the voltage levels at the decoder outputs, i.e., the SRAM word-line drivers Here, the decoders reside in the scaled VDD domain and fewer control signals must be level shifted to the VDDSRAM domain

6.4 PLL and Clock Generation Considerations

In this section, the implications of DVS on microprocessor clocking are considered In the original 180 nm implementation, a simple approach was taken—there are minimal changes to the PLL and clock generation unit to support DVS The feedback from the core clock tree to the PLL requires a PLL relock time for each clock change In the 90 nm prototype, the PLL and clock generation unit was explicitly designed to support zero latency clock frequency changes Here, the PLL is derived from the I/O supply voltage via an internal linear regulator Hence, the PLL power supply is not dynamically scaled with the processor core

6.4.1 Clock Generation for DVS on the 180 nm 80200 XScale

Microprocessor

The clock generation unit in the 80200 is shown in Figure 6.9 The ½ divider provides a high quality, nearly 50% duty cycle output The feedback clock is derived from the core clock, to match the core clock (and I/O clock, which is not shown) phase to the reference clock Experiments with PLL test chips showed that phase and frequency lock can be retained during voltage movements, if the PLL power supply rejection ratio is sufficient and the slew rate is well controlled [21,22] This allows voltage adjustment while the processor is running, as mentioned However, a change in the clock frequency changes the numerator in the 1/N feedback clock divider This causes an abrupt change in the frequency of the signal

PLL to relock to the new frequency The PLL generates a lock signal, derived from the charge pump activity Depending on the operating voltage, the PLL can achieve lock as quickly

as a few microseconds However, a dynamic lock time makes customer specification and testing more difficult—hence, a fixed lock time is used Another scheme, which allows digital control of the clock divider ratio was developed for the 90 nm XScale prototype test chip

Feedback Clk, which necessitates the

Định dạng
Số trang	20
Dung lượng	793,01 KB