If half of the cycles are stalls waiting for the bus, as determined by a combination of the total clock count, instructions executed, and data dependency stall or bus request counts, the
Trang 1As a concrete example, assume the processor is running at 1 GHz and
VDD = 1.75 V If half of the cycles are stalls waiting for the bus, as determined by a combination of the total clock count, instructions executed, and data dependency stall or bus request counts, the VDD can be adjusted to 1.2 V (see Figure 6.2) and the core frequency reduced to 500 MHz Useful work is then performed in a greater number of the (fewer overall) core clock cycles Referring to Figure 6.2, the power savings is nearly 50% with the same work finished in the same amount of time
6.2 Dynamic Voltage Scaling on the XScale Microprocessor
This section describes experimental results running DVS on the 180 nm XScale microprocessor The value of DVS is evident in Figure 6.3 Here, the 80200 microprocessor is shown functioning across a power range from
10 mW in idle mode, up to 1.5 W at 1 GHz clock frequency The idle mode power is dominated by the PLL and clock generation unit The processor core includes the capacity to apply reverse-body bias and supply collapse [10, 11] to the core transistors for fully state-retentive power-down The microprocessor core consumes 100 μW in the low standby
“Drowsy” mode [12] The PLL and clock divider unit must be restarted when leaving Drowsy mode When running with a clock frequency of 200 MHz, the VDD can be reduced to 700 mV, providing power dissipation less than 45 mW
Figure 6.3 The value of dynamic voltage scaling is evident from this plot of the
80200 power and VDD voltage over time The power lags due to the latency of the
measurement and time averaging
4IME
6$$
&REQUENCY
0OWER
Trang 2
6.2.1 Running DVS
To demonstrate DVS on the XScale, a synthetic benchmark programmed using the LRH demonstration board is used here The onboard voltage regulator is bypassed, and a daughter-card using a Lattice GAL22v10 PLD controller and a Maxim MAX1855 DC-DC converter evaluation kit is added The DC–DC converter output voltage can vary from 0.6 to 1.75 V The control is memory mapped, allowing software to control the processor core VDD
The synthetic benchmark loops between a basic block of code that has a data set that fits entirely in the cache (these pages are configured for write-back mode) and one that is non-cacheable and non-bufferable The latter requires many more bus operations, since the bus frequency of 100 MHz is lower than the core clock frequency, which must be at least 3× the bus frequency on the demonstration board
The code monitors the actual operational CPI using the processor PMU The number of executed instructions as well as the number of clocks, since the PMU was initialized and counting began, are monitored The C code, with inline assembly code to perform low-level functions is
unsigned int count0, count1, count2;
int cpi() {
int val;
// read the performance counters
asm("mrc p14, 0, r0, c0, c0, 0":::"r0"); // read the PMNC register
asm("bic r1, r0, #1":::"r1"); // clear the enable bit
asm("mcr p14, 0, r1, c0, c0, 0":::"r1"); // clear interrupt flag, disable counting // read CCNT register
asm("mrc p14, 0, %0, c1, c0, 0" : "=r" (count0) : "0" (count0));
asm("mrc p14, 0, %0, c2, c0, 0" : "=r" (count1) : "0" (count1));
asm("mrc p14, 0, %0, c3, c0, 0" : "=r" (count2) : "0" (count2));
return(val = count0);
}
int startcounters() {
unsigned int z;
// set up and turn on the performance counters
z = 0 × 00710707;
asm("mov r1, %0" :: "r" (z) : "r1"); // initialization value in reg 1
asm("mcr p14, 0, r1, c0, c0, 0" ::: "r1"); // write reg 1 to PMNC
}
Trang 3
Note that the code to utilize the PMU is neither large nor complicated It
is also straightforward to implement the actual VDD and core clock rate changes To avoid creating a timing violation in the processor logic, the core voltage VDD must always be sufficient to support the core operating frequency This requires that the voltage be raised before the frequency is and conversely that the frequency be reduced before the voltage is The XScale controls the clock divider ratio from the PLL through writes to CP14 The C code to raise the VDD voltage is
int raisevoltage() {
int i;
// raise the voltage first
if (voltage <= TOP_V) { // leave it alone
printf ("V at end of range ");
}
else {
voltage ;
*voltagep = voltage;
// adjusting the frequency
if (frequency < TOP_F) { // do nothing
frequency = uf[voltage];;
asm("mov r1, %0" :: "r" (frequency) : "r1");
asm("mcr p14, 0, r1, c6, c0, 0" ::: "r1");
}
}
return(voltage);
}
The code to lower the voltage is very similar The supported clock multipliers range from 3 to 11 [9] The array uf[] is a lookup table of appropriate voltages for each frequency The PLD is programmed so that the highest voltage of 1.7 V is programmed by setting the value to 0 and higher values increase the voltage by 50 mV (for the first four entries) or
setting the voltage too low No delay is required, since the coprocessor register write forces the core clocks to be inactive for approximately 20 μs while the PLL relocks to the new clock fraction—this is handled automatically by the XScale core hardware Excellent power supply rejection ratio (PSRR) in the 80200 PLL allows the relock to occur in
Trang 4parallel with the voltage movement The code to lower the voltage is similar, but as mentioned, the frequency is reduced before lowering the voltage Again, the PLL lock time, invoked before the MCR P14 instruction can finish, hides the latency of the voltage movement from the software
Figure 6.4 Simple DVS control heuristic using an estimate of the CPI as
determined by the PMU The CPI is estimated for each time slice and VDD adjusted if it is outside the dead-band parameters CPI_DB_high and
CPI_DB_low Otherwise the VDD and clock frequency are unchanged Here, for illustration purposes, the control algorithm is very simple, as shown in Figure 6.4 All but the “Execute time slice…” block would be part of the OS Behavior of the synthetic benchmark, using the code shown above, is shown in Figure 6.5(a) Many complicated and hence more optimal VDD control algorithms have been developed but are application dependent and beyond the scope of this discussion The frequency and voltage are increased by one increment if the measured CPI is below the predetermined value CPI_DB_high, and they are decreased by one increment if the CPI is above another predetermined value CPI_DB_low
It is left the same otherwise, i.e., the control dead-band is defined by the separation of the two values Figure 6.5(b) shows the intervals more closely The intervals running the bus limited data access code are marked
by A, and the faster running (cacheable data) code is marked by B The
distinct VDD voltage steps when the frequency and voltage are changed as the data accesses move from one behavior to the other are evident
Trang 5Figure 6.5 Oscilloscope traces of VDD on the LRH test system The system is running a synthetic benchmark that modifies VDD based on the CPI as determined
by the PMU (a)–(d) The distinct steps in voltage with each software-controlled clock rate and VDD change are evident The VDD slew rate is shown in (e), where
the supply ripple can also be seen
Adjusting the size of the control heuristic dead-band to be too small causes the voltage to “hunt” when running the faster code, as evident in
Figure 6.5(c) section B since a stable CPI value between that which causes
an increase and that which causes a decrease is not found This hunting behavior is not efficient, since the PLL lock time is wasted for each 50 mV
VDD movement It is therefore important to define a large enough stable region and make DVS changes (monitor the CPI) infrequently enough to keep the total voltage change time insignificant compared to the total operating time A further adjustment in the heuristic affects the minimum usable voltage, by allowing still slower operation for the bus limited code
A
A
A A
B
B
B
(a) Horizontal scale: 1 s/div
Horizontal scale: 400 ms/div
Horizontal scale: 400 ms/div
Horizontal scale: 200 ms/div
Horizontal scale: 20 us/div
(b)
(c)
(d)
(e)
Trang 6Figure 6.5(e) shows the maximum slew rate for the large voltage change from 1.0 to 1.7 V, which is the nearly vertical VDD movement near the end
of the trace in Figure 6.5(d) The core VDD is slightly over-damped, as evident in Figure 6.5(e)
6.3 Impact of DVS on Memory Blocks
As mentioned in the introduction, some circuits may limit operation at low
VDD Microprocessors and SOC ICs include numerous memories, usually implemented with six transistor SRAM cells In future devices, it is expected that memory, and SRAM in particular, will dominate IC area [13] Unfortunately, SRAM has diminishing read stability [14] as manufacturing processes are scaled down in size and transistor level variations increase [15] Lower VDD profoundly reduces SRAM read stability, making it a primary limiting circuit when applying DVS
When the SRAM is read, the low storage node rises due to the voltage divider comprised of the two series NMOS transistors in the read current path, which includes one of the storage nodes Monte Carlo simulations of SRAM static noise margin are shown in Figure 6.6 As VDD is decreased, the static noise margin (SNM) as measured by the smallest side of the square with largest diagonal in the small side of the static voltage curves (see Figure 6.6(a)) decreases as well The large transistor mismatch due to both systematic (intra-die) and random (within-die) variations cause asymmetry in the SNM plot as shown in Figure 6.6(a) An IC contains many SRAM cells, so the combination of worst-case systematic and random variations can cause some cells to fail, significantly impacting the manufacturing yield at low VDD The simulated behavior of the SRAM SNM vs voltage, using Monte Carlo device variations to 5σ, is shown in Figure 6.6(b) It is evident that the SRAM read margins are strongly affected by the combination of transistor variation and reduced VDD Register file memory, which is also ubiquitous in microprocessors and SOC Ics, does not suffer from reduced SNM when reading since the read current path does not pass through the SRAM storage nodes These memories can scale with any core logic and can in fact operate effectively well into subthreshold, i.e., they allow operation with VDD < Vth [16, 17]
6.3.1 Guaranteeing SRAM Stability with DVS
In the 180 nm process used for the XScale, the manufacturing yield is negligibly impacted by SRAM read stability, even at VDD = 0.7 V when
Trang 7only the two 32kB caches are considered However, adding large SOC SRAMs significantly affects the IC manufacturing yield at low VDD The solution used for the 180 nm “Bulverde” application processor SOC [18] is
to scale the XScale cache circuits with the dynamically scaling core and SOC logic supply voltage, while operating the large SOC SRAM on a fixed supply [19] The SRAMs and their voltage domains are shown in Figure 6.7 The SOC logic clock rate is 104 MHz or less depending on the DVS point, while the core clock frequency scales from 104 MHz to over
Figure 6.6 SRAM SNM at various voltages (a) The mean and 5σ SNM from Monte Carlo simulations (b) show vanishing SNM at low voltages The XScale SOC logic level shifts SRAM input signals and operates the SRAMs at a constant
voltage where SNM is maintained
Trang 8Figure 6.7 SRAMs and their voltage domains in the XScale core and in
the Bulverde application processor [20] This diagram is greatly simplified
to emphasize the DVS vs constant VDD domains
500 MHz [18, 20] A constant 1.1 V SRAM power supply voltage (VDDSRAM) provides adequate access times for the slower SOC logic In this manner, the SOC and microprocessor core logic VDD employ DVS, but the embedded SOC SRAM supply VDDSRAM is fixed The fixed, higher minimum VDD for the additional SOC SRAMs assures high manufacturing yield with a low minimum VDD for DVS The fixed SRAM supply voltage also facilitates the low standby power Drowsy modes, which have a single optimal VDD that must be sufficient to allow raising the NMOS transistor source nodes toward VDD to apply NMOS body bias [11]
With two differing supply voltages, level shifting is required between the memories and the SOC logic The added level shifters degrade the maximum performance, since they add delay This is not an issue for low
VDD operation—the higher SRAM VDD makes them fast compared to the surrounding logic operating at lower VDD The problem is that the level shifters slow the maximum clock rate of the design at high VDD by injecting extra delay in the memory access path
The Bulverde SOC memory level shifting scheme is shown in Figure 6.8(a) To minimize the number of level shifters and limit the complexity, the address ADD(1:m) and some control signal voltages are translated to the different VDDSRAM power supply domain by the cross-coupled level shifting circuit evident at the decoder inputs This scheme has the drawback that the word-line enable signal WLE, which is essentially a clock, and the array pre-charge signal PRECHN must be level
Trang 9shifted The write and read column multiplexer control signals must also
be level shifted—for clarity, these circuits are not shown in the figure The differential sense amplifiers, which operate at the (potentially lower) DVS domain supply voltage, automatically shift the SRAM outputs OUTDATA
to the correct voltage range The sense timing signal SAE is also in the DVS domain
Figure 6.8 Level shifting paths to allow the SRAM supply voltage VDDSRAM to remain constant while applying DVS to the surrounding logic In (a) the level shifters are placed at the SRAM block interface, while in (b) the level shifters are
at the storage array interface In both cases, the sense amplifiers shift back to the
DVS domain
Trang 10Additional power can be saved by the scheme shown in Figure 6.8(b), which shifts the voltage levels at the decoder outputs, i.e., the SRAM word-line drivers Here, the decoders reside in the scaled VDD domain and fewer control signals must be level shifted to the VDDSRAM domain
6.4 PLL and Clock Generation Considerations
In this section, the implications of DVS on microprocessor clocking are considered In the original 180 nm implementation, a simple approach was taken—there are minimal changes to the PLL and clock generation unit to support DVS The feedback from the core clock tree to the PLL requires a PLL relock time for each clock change In the 90 nm prototype, the PLL and clock generation unit was explicitly designed to support zero latency clock frequency changes Here, the PLL is derived from the I/O supply voltage via an internal linear regulator Hence, the PLL power supply is not dynamically scaled with the processor core
6.4.1 Clock Generation for DVS on the 180 nm 80200 XScale
Microprocessor
The clock generation unit in the 80200 is shown in Figure 6.9 The ½ divider provides a high quality, nearly 50% duty cycle output The feedback clock is derived from the core clock, to match the core clock (and I/O clock, which is not shown) phase to the reference clock Experiments with PLL test chips showed that phase and frequency lock can be retained during voltage movements, if the PLL power supply rejection ratio is sufficient and the slew rate is well controlled [21,22] This allows voltage adjustment while the processor is running, as mentioned However, a change in the clock frequency changes the numerator in the 1/N feedback clock divider This causes an abrupt change in the frequency of the signal
PLL to relock to the new frequency The PLL generates a lock signal, derived from the charge pump activity Depending on the operating voltage, the PLL can achieve lock as quickly
as a few microseconds However, a dynamic lock time makes customer specification and testing more difficult—hence, a fixed lock time is used Another scheme, which allows digital control of the clock divider ratio was developed for the 90 nm XScale prototype test chip
Feedback Clk, which necessitates the