Adaptive Techniques for Dynamic Processor Optimization Theory and Practice by Alice Wang and Samuel Naffziger_13 pdf

The third version, FI-T, assigns each domain a baseline frequency that is equal to the synchronous baseline’s frequency, but then scales each domain’s frequency for its temperature accor

Trang 1

A baseline leakage current at 25oC is taken from ITRS and then scaled according to temperature HotSpot updates chip temperatures every 5 µs,

at which point the simulator computes a leakage scaling factor for each block (at the same granularity used by Wattch) and uses it to scale the leakage power computed every cycle until the next temperature update

9.4.2 Frequency Island Simulator

This synchronous baseline was the starting point for an FI simulator It is split into five clock domains: fetch/decode, rename/retire/register read, integer, floating point, and memory Each domain has a power model for its clock signal that is based on the number of pipeline registers within the domain Inter-domain communication is accomplished through the use of asynchronous FIFO queues [6], which offer improved throughput over many other synchronization schemes under nominal FIFO operation Several versions of the FI simulator were used in the evaluation The first is the baseline version (FI-B), which splits the core into multiple clock domains but runs each one at the same 4.0 GHz clock speed as the synchronous baseline (SYNCH) This baseline FI processor does not implement any variability-aware frequency scaling; all of the others do The second FI microarchitecture speeds up each domain as a result of the individual domains having fewer critical paths than the microprocessor

as a whole The speedups are taken from Table 9.3, and this version is called FI-CP In the interests of reducing simulation time, only the mean speedups were simulated These represent the average benefit that an FI processor would display over an equivalent synchronous processor on a per-domain basis over the fabrication of a large number of dies

The third version, FI-T, assigns each domain a baseline frequency that is equal to the synchronous baseline’s frequency, but then scales each domain’s frequency for its temperature according to Equation (9.7) after every chip temperature update (every 20,000 ticks of a 4.0 GHz reference clock)

A final version, FI-CP-T, uses the speeds from FI-CP as the baseline domain speeds and then applies thermally aware frequency scaling Both FI-T and FI-CP-T perform dynamic frequency scaling using an aggressive Intel XScale-style DFS system as in [16]

9.4.3 Benchmarks Simulated

In order to accurately account for the effects of temperature on leakage power and power on temperature, simulations are iterated for each

Trang 2

workload and configuration, feeding the output steady-state temperatures

of one run back in as the initial temperatures of the next in search of a consistent operating point This iteration continues until temperature and power values converge, rather than performing a set number of iterations With this methodology, the initial temperatures of the first run do not affect the final results, but only the number of iterations required

The large number of runs required per benchmark prevented simulation

of the entire suite of SPEC2000 benchmarks due to time constraints

Simulations were completed for seven of the benchmarks: the 164.gzip, 175.vpr, 197.parser, and 256.bzip2 integer benchmarks and the 177.mesa, 183.equake, and 188.ammp floating point benchmarks

9.5 Results

The FI configurations are compared on execution time, average power, total energy, and energy delay2 in Figure 9.5

9.5.1 Frequency Island Baseline

Moving from a fully synchronous design to a frequency island, one (FI-B) incurs an average 7.5% penalty in execution time There is a fair amount of variation between benchmarks in the significance of the performance

degradation Both 164.gzip and 197.parser run about 11% slower, while 177.mesa and 183.equake only suffer a slowdown of around 2% Broadly,

floating point applications are impacted less than integer ones since many

of their operations inherently have longer latencies, reducing the

The simulation methodology addresses time variability by simulating three points within each benchmark, starting at 500, 750, and 1,000 million instructions and gathering statistics for 100 million more The

one exception was 188.ammp, which finished too early Instead, it was

fast-forwarded 200 million instructions and then run to completion Because the FI microprocessor is globally asynchronous, space variability is also an issue (e.g., the exact order in which clock domains tick could have a significant effect on branch prediction performance as the arrival time of prediction feedback will be altered) The simulator randomly assigns phases to the domain clocks, which introduces slight perturbations into the ordering of events and so averages out possible extreme cases over three runs per simulation point per benchmark Both types of variability were thus addressed using the approaches suggested

by Alameldeen and Wood [1]

Trang 3

significance of the latency added due to crossing FI boundaries (188.ammp

seems to be an exception) Workloads which exhibit large numbers of stalls due to waiting on memory or other resources are those which observe the smallest performance penalties, since the extra latency due to

FI is almost completely hidden behind these stalls

Due to the use of small local clock networks and the stretching of execution time, the FI processor draws 10.7% less power per cycle, resulting in a consumption of 4.0% less energy than the synchronous baseline over the execution of the same instructions Energy-delay2 is increased by 11.3% in making the move to the baseline FI architecture, making it uncompetitive for all but the most power-limited applications (in which case the 10% reduction in power draw might not be large enough to

be significant)

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Figure 9.5 Simulation results relative to the synchronous baseline

9.5.2 Frequency Island with Critical Path Information

FI-CP adds the speedups calculated from the critical path information in Section 9.2.4 to the FI baseline Despite the average per-domain speedup

in FI-CP being 3.1%, execution time decreases by only 1.4% because of the mismatch between speedups The fetch and memory domains are

Trang 4

barely sped up at all as a result of the large number of critical paths in the first-level caches, in keeping with the findings of Humenay et al that the L1 caches are limiters of clock frequency in modern microprocessors [9] This decreases the average number of executed instructions per clock tick for each back-end domain because of two factors First, instructions are entering their window at a relatively reduced rate due to the low instruction cache speedup Second, load-dependent instructions must wait relatively longer for operands due to the low data cache speedup Thus, although the computation domains cycle more often, there is a much smaller increase in the amount of work that can actually be done in a fixed amount of time

Benchmarks which are computation limited see the largest

improvements, while those that are memory limited gain little 183.equake

actually appears to suffer an increase in execution time, which is likely due

to simulation variability As a result of the faster clocking of domains, the average power drawn per cycle increases very slightly when enabling these speedups (by about 1.4%) However, the faster execution leads to essentially no change in energy usage and an overall energy-delay2 reduction of 2.9% These small improvements alone do not create a design which is competitive with the fully synchronous baseline

FI-CP suffers from the domain partitioning used, which is performed based on the actual functionality of blocks without taking into account the number of critical paths that they contain A better partitioning might use some metric that relates the number of critical paths in a block to its criticality to performance However, “criticality to performance” can be difficult to quantify, since the critical path through the core may be different for different applications

Moreover, there is overhead associated with every domain and domain boundary crossing Combining domains can reduce the required number of domain boundary crossings as well as design complexity, but will also reduce the power savings introduced by the FI clocking scheme (since it merges some small clock networks to create a single larger one) Furthermore, it reduces the flexibility of the FI microarchitecture and might impact opportunities for VAFS or dynamic voltage/frequency scaling On the other hand, splitting a clock domain into multiple smaller domains requires the opposite set of trade-offs to be evaluated

9.5.3 Frequency Island with Thermally Aware Frequency Scaling

FI-T applies thermally aware frequency scaling to the FI baseline, running each domain as fast as possible given its current temperature rather than

Trang 5

always assuming the worst-case temperature FI-T offers significantly better performance than FI-B or FI-CP In fact, accounting for dynamic thermal variation results in an average execution time reduction of 8.7% when compared to the fully synchronous baseline As expected, the performance improvement enabled by thermally aware frequency scaling

is highly dependent on the behavior (both thermal and otherwise) of the

workload under consideration 188.ammp runs cool and so sees a large

performance boost, finishing in 14.4% less time on the FI processor with thermally aware frequency scaling than on the synchronous baseline However, many other benchmarks see similar frequency increases, but do

not display as large an execution time reduction For example, 183.equake

is memory-bound and so gains relatively little (only a 1% speedup relative

to the synchronous baseline), despite the large increases in the clock domain frequencies of the clock domains

Two things are required to see a significant performance gain from thermally aware frequency scaling: sufficient thermal headroom and application behavior which can take advantage of the resulting frequency increases in the core This translates into a large amount of variation in the change in energy-efficiency brought about by enabling thermally aware frequency scaling

Since FI-T runs the domain clocks somewhat faster than the baseline speed, a significant average power penalty of 13.3% relative to the synchronous baseline is observed This corresponds to a 3.4% increase in the amount of energy used to perform the same amount of computation However, the larger reduction in the amount of time required to perform the computation leads to an average energy-delay2 13.4% lower than the synchronous baseline’s energy

FI-T suffers somewhat from nạvely speeding up domains regardless of whether this improves performance or not The most egregious example is the speeding up of the floating point domain in the integer benchmarks This may even adversely affect performance because each clock tick dissipates some power, regardless of whether there are any instructions in the domain or not This results in higher local temperatures, which may spill over into a neighboring domain which is critical to the performance and causes it to be clocked at a lower speed

One solution is to use some control scheme similar to those used for DVFS to decide whether a domain should actually be sped up and by how much Equation (9.7), which describes the scaling of frequency with temperature, also includes the dependence of frequency on supply voltage,

so DVFS could possibly be integrated with VAFS An integrated control system would be required to prevent the two schemes from pulling clock frequency in opposite directions This area requires further research

Trang 6

Like FI-CP, FI-T could also benefit from a more intelligent domain partitioning Since each domain’s speed is limited by its hottest block, it might make sense to group blocks into domains based on whether they tend

to run cool, hot, or in between However, while there are some functional blocks which can be identified as generally being hotspots (e.g., the integer register file and scheduling logic), the temperature at which other blocks run is highly workload-dependent (e.g., the entire floating point unit)

9.5.4 Frequency Island with Critical Path Information

The results for FI-CP-T, which applies both variability-aware frequency schemes, show that the two are largely additive A 10.0% reduction in execution time is achieved at the cost of 14.6% higher average power; the total energy penalty is 3.0% This actually represents a reduction in the amount of energy consumed relative to FI-T The reduction in execution time from FI-T to FI-CP-T is 1.4%, the same as that observed in moving from FI-B to FI-CP An initial fear when combining FI-CP and FI-T was that the higher baseline speeds as a result of the -CP speedups would result

in a sufficient increase in temperature to reduce the -T speedups by an equal amount, resulting in a scheme that offered no better performance than FI-T and was more complex However, these results show that the speedups applied by FI-CP and FI-T are largely independent The final energy-delay2 reduction offered by full VAFS is 16.1%

The synergy between the two schemes is due to the fact that the caches tend to run cool As a result, thermally aware frequency scaling speeds up the clock domains containing the L1 caches slightly more than the others, which helps to mitigate the lack of speedup for the caches in FI-CP Thus, the speedups of the computation domains due to considering critical path information can be better taken advantage of

9.6 Conclusion

Variability is one of the major concerns that microprocessor designers will have to face as technology scaling continues It is potentially easier for a frequency island design to address variability as a result of the processor being partitioned into multiple clock domains, which allows the negative effects of variability on maximum frequency to be localized to the domain they occur in This variability-aware frequency scaling can be used to address both process and thermal variabilities

and Thermally Aware Frequency Scaling

Trang 7

The effects of random within-die process variability will be difficult to mitigate using a simple FI partitioning of the core The large number of critical paths in a modern processor means that even decoupling groups of functional blocks with relatively low critical path counts from those with higher ones does not yield a large improvement in their mean frequencies

On the other hand, exploiting the thermal slack between current operating temperatures and the maximum operating temperature by speeding up cooler clock domains proves to have significant performance and energy-efficiency benefits An FI processor with such thermally aware frequency scaling is capable of overcoming the performance disadvantages inherent to the FI design style to achieve better performance than a similarly organized fully synchronous microprocessor

As technology continues to scale, the magnitude of process variations will increase due to the need to print ever-smaller features, while thermal variation also worsens due to greater transistor density causing a higher difference in power densities across the chip It will soon be the case that such variations can no longer be handled below the microarchitecture level and abstracted away, and the benefits from creating a variability-tolerant or variability-aware microarchitecture will outweigh the increased work and design complexity involved

Acknowledgments

The authors thank Siddharth Garg for his assistance with generating the critical path model results

References

[1] A Alameldeen and D Wood, “Variability in Architectural Simulations of Multi-threaded Workloads”, HPCA’03: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003, pp 7–18 [2] K Bowman, S Duvall and J Meindl, “Impact of Die-to-die and Within-die Parameter Fluctuations on the Maximum Clock Frequency Distribution for Gigascale Integration”, IEEE Journal of Solid-State Circuits, February 2002, Vol 37, No 2, pp 183–190

[3] K Bowman, S Samaan and N Hakim, “Maximum Clock Frequency Distribution with Practical VLSI Design Considerations”, ICICDT’04: Proceedings of the International Conference on Integrated Circuit Design and Technology, 2004, pp 183–191

Trang 8

[4] D Brooks, V Tiwari and M Martonosi, “Wattch: A Framework for Architectural-level Power Analysis and Optimizations”, ISCA’00: Proceedings of the 27th International Symposium on Computer Architecture,

2000, pp 83–94

[5] J Butts and G Sohi, “A Static Power Model for Architects”, MICRO 33: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, 2000, pp 191–201

[6] T Chelcea and S Nowick, “Robust Interfaces for Mixed Systems with Application to Latency-insensitive Protocols”, DAC’01: Proceedings of the 38th annual Design Automation Conference, 2001, pp 21–26

[7] S Herbert, S Garg and D Marculescu, “Reclaiming Performance and Energy Efficiency from Variability”, PAC2'06: Proceedings of the 3rd Watson Conference on Interaction Between Architecture, Circuits, and Compilers, 2006

[8] H Hua, C Mineo, K Schoenfliess, A Sule, S Melamed and W Davis,

“Performance Trend in Three-dimensional Integrated Circuits”, IITC'06: Proceedings of the 2006 International Interconnect Technology Conference,

2006, pp 45–47

[9] E Humenay, D Tarjan and K Skadron, “Impact of Parameter Variations on Multi-core Chips”, ASGI’06: Proceedings of the 2006 Workshop on Architectural Support for Gigascale Integration, 2006

[10] A Iyer and D Marculescu, “Power and Performance Evaluation of Globally Asynchronous Locally Synchronous Processors”, ISCA’02: Proceedings of the 29th International Symposium on Computer Architecture, 2002, pp [11] X Liang and D Brooks, “Mitigating the Impact of Process Variations on Processor Register Files and Execution Units”, MICRO 39: Proceedings of the 39th Annual ACM/IEEE International Symposium on Microarchitecture,

2006, pp 504–514

[12] D Marculescu and E Talpes, “Variability and Energy Awareness: A Microarchitecture-level Perspective”, DAC’05: Proceedings of the 42nd annual Design Automation Conference, 2005, pp 11–16

[13] M Orshansky, C Spanos and C Hu, “Circuit Performance Variability Decomposition”, IWSM’99: Proceedings of the 4th International Workshop

on Statistical Metrology, 1999, pp 10–13

[14] G Semeraro, G Magklis, R Balasubramonian, D Albonesi, S Dwarkadas and M Scott, “Energy-efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling”, HPCA’02: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, 2002, pp 29–42

[15] K Skadron, M Stan, W Huang, S Velusamy, K Sankaranarayanan and D Tarjan, “Temperature-aware Microarchitecture”, ISCA’03: Proceedings of the 30th International Symposium on Computer Architecture, 2003, pp 2–13 [16] Q Wu, P Juang, M Martonosi and W Clark, “Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors”, ASPLOS-XI: Proceedings of the 11th International Conference on 158–168

Trang 9

Architectural Support for Programming Languages and Operating Systems,

2004, pp 248–259

[17] W Zhao and Y Cao, “New Generation of Predictive Technology Model for Sub-45nm Design Exploration”, ISQED’06: Proceedings of the 7th International Symposium on Quality Electronic Design, 2006, pp 585–590

Trang 10

Asynchronicity in Processor Design

Steve Furber, Jim Garside

The University of Manchester, UK

10.1 Introduction

Throughout most of the history of the microprocessor, designers have em-ployed an approach based on the use of a central clock to control func-tional units within the processor While there are situations – such as the musicians in a symphony orchestra or the crew of a rowing boat – where global synchrony is a vital aspect of the overall functionality, a microproc-essor is not such a system Here the clock is merely a design convenience,

a constraint on how the system’s components operate that simplifies some design issues and allows the use of a well-developed VLSI design flow where the designer can analyse the entire system state at any instant and use this to influence the transition to the next state The clock has become

so dominant in modern processor design that few designers ever stop to

consider dispensing with it; however, it is not necessary – synchronisation

may be restricted to places where it is essential to function

Although a tremendous aid in simplifying a complex design task, the globally clocked model does have its drawbacks In engineering terms, perhaps, the greatest problem is the difficulty of sustaining the fundamen-tal assumption of the model, which is that the clock arrives simultaneously

at every latch in the system This not only is a considerable headache in its own right but also results directly in undesirable side effects such as power wastage and high levels of electromagnetic emission However, here the primary concern is adaptivity and, in this too, the synchronous model is an obstacle

A Wang, S Naffziger (eds.), Adaptive Techniques for Dynamic Processor Optimization,

Định dạng
Số trang	19
Dung lượng	1 MB