Handbook of algorithms for physical design automation part 92 potx

Neves and Friedman [22] formulated the process variation tolerant optimal clock skew scheduling problem.. To better control the effects of process variations, they find the permissible r

Trang 1

FF i

logic

FF j

D ij

clk2q

FIGURE 42.11 Clock hazards and timing constraints.

FF i and FFj as shown in Figure 42.11 Let ti and tj be the clock delays from clock source to FFiand

clock-to-Q delay for FFi Let tsetupj and thold

j be the setup time and hold time for FFj, respectively Let

P be the clock period The setup time and hold time constraints can be expressed as

t i + t clk2q

i + MAXD ij

+ tsetup

t i + t clk2q

i + MIND ij

≥ tj + thold

A clock schedule is a set of delays from clock source to all registers in the synchronous system The clock scheduling problem is to find a clock schedule{t1, , t N} for all registers FF1, , FF N

to minimize the clock period P while satisfying the constraints in Equations 42.2 and 42.3 This

problem can be formulated as a linear program as follows [20]:

LP_SPEED: Minimize P

subject to t j − ti ≥ tsetup

j + t clk2q

i + MAXD ij

− P for i, j = 1, , N

t i − tj ≥ thold

j − t clk2q

i − MIND ij

for i, j = 1, , N

t i ≥ MIN_DELAY for i = 1, , N Alternatively, we can find a clock schedule to maximize the minimum safety margin M for a given clock period P This problem can be formulated as a linear program as follows:

LP_SAFETY: Maximize M

subject to t j − ti ≥ tsetup

j + t clk2q

i + MAXD ij

− P + M for i, j = 1, , N

t i − tj ≥ thold

j − t clk2q

i − MIND ij

+ M for i, j = 1, , N

In both formulations, MAX(D ij ) = −∞ and MIN(D ij ) = ∞ if there is no combinational path from

FF i to FFj.

After the clock schedule S = {t1, , t N} is computed, the next step is to construct a clock network

to realize the obtained schedule The DME algorithm in Section 42.2.4 can be easily extended to handle this problem We only need to construct the merging segments to achieve the given skews instead of zero skews in the bottom-up phase of the DME algorithm However, the solutions of the

linear programs may not be unique Each clock delay t icould be a range rather than a fixed value

In this case, the clock routing problem becomes the bounded-skew routing tree (BST) problem In Ref [21], Cong et al proposed two algorithms, BME (boundary merging and embedding) and IME (interior merging and embedding), to handle this problem These two algorithms extend the DME algorithm by finding a polygonal region based on the skew bounds rather than a merging segment

to represent all possible locations for each tapping point

Trang 2

Apart from the original formulations of clock scheduling, there are some other extensions Neves and Friedman [22] formulated the process variation tolerant optimal clock skew scheduling problem

To better control the effects of process variations, they find the permissible range (i.e., the range of the clock skew without timing violation) for each local path, select a clock skew value that allows

a maximum variation of skew within the permissible range, and finally determine the clock delay

to each register Recently, Ravindran et al [23] discussed the multidomain clock skew scheduling

problem For a given number of clocking domains n and a maximum permissible within-domain

latencyδ, the multidomain range constraints require that all clock latencies must fit into n value

ranges(l(d i ), l(d i ) + δ) for i = 1, , n The objective of multidomain clock skew scheduling is to determine domain phase shifts l (d i ) and register latencies that satisfy the clock domain constraints

and minimize the clock period

Finally, we want to have a brief discussion on two similar sequential optimization techniques, clock scheduling and retiming They are, respectively, continuous and discrete optimizations with the same effect on minimizing the clock period [20] The equivalence of the two techniques was

studied in Ref [24] It is proved that there exists a retiming R to achieve clock period P if and only

if there exists a clock schedule S with the same clock period However, the practical use of retiming

is limited due to two reasons First, retiming has adverse impact on the verification methodology Second, using retiming for maximum performance often causes a steep increase in the number

of registers Clock scheduling does not have these two limitations Another advantage of clock scheduling is that because retiming can only move registers across discrete amounts of logic delay, the resulting system after retiming can still benefit from clock scheduling

42.5 HANDLING VARIABILITY

In minimizing skew sensitivity to process variations, two guiding principles are that the network should be as symmetrical and as fast as possible In a clock network designed and laid out sym-metrically, chipwide process (or environmental) variations should affect all clock paths identically

An additional advantage is that any systematic skew caused by modeling errors is eliminated by symmetry In a fast network, as the clock phase delay is small, any fractional variations in delay lead only to a modest amount of skew In addition, a clock network with optimal delay is the most tolerant to process variations At the optimal delay point with respect to a certain parameter, the delay sensitivity over that parameter (i.e., the slope of the delay function) should be zero However,

it is not trivial to apply these two principles in practice Because of uneven load distribution and routing/buffer obstacles, it is usually impossible to construct a completely symmetrical network Moreover, minimizing the network delay may be conflicting with the optimization of some other metrics (e.g., skew, area and power) Several important works on reliable clock network design under process variations are discussed below

The concept of delay sensitivity is very useful in considering process variations Pullela et al [25] first made use of delay sensitivity with respect to wire width variations to improve the delay, skew and skew sensitivity of a given clock tree by wire width optimization The Elmore delay model and

the L-type RC model for each branch are used in the paper, but the concept can be generalized to other models Let Rj be the resistance, Cj be the capacitance, and C djbe the downstream capacitance

of branch j Let U (i) be the set of all branches on the path from sink i to the root Then the Elmore delay from the root to sink i is T di =j ∈U(i) R j C dj Therefore, the sensitivities of Elmore delay of

sink i with respect to circuit parameters Cj and Rj are

∂T di

∂C j

∂T di

C dj if j ∈ U (i)

Trang 3

where R cij is the total resistance along the common path from sink i to the root and branch j to the root R j and C j can be expressed as functions of width w j of branch j as R j = R0 L j /w j and

C j = Ca L j w j + Cf L j , where R0, C a , and C f are technology parameters independent of w j , and L jis

the length of branch j Therefore, the delay sensitivity of sink i to width w j is

∂T di

∂w j

=∂T di

∂C j

∂w j

+∂T di

∂R j

∂w j

=∂T di

∂C j

C a L j−∂T di

∂R j

R0L j

w2

j

By incremental computation as described in Ref [26], Equations 42.4 and 42.5 for all i and j can

be computed in O (n2) time for a tree with n sinks.∗Hence, the delay sensitivities for all sinks to all

branch widths can also be found in O (n2) time.

In Ref [25], a greedy heuristics is proposed to iteratively increase the widths to improve delay, skew, and skew sensitivity The selection of the branch to widen in each step is based on the delay sensitivities, which give the delay change of each sink when widening a branch In particular, they argued that wire widening is a better method for delay balancing than wire elongating as widening generally reduces skew sensitivity but elongating increases it

Lu et al [27] formulated the minimizing skew violation (MinSV) problem to construct a clock tree considering wire width variation due to process variations Given the range of permissible skew for each pair of clock sinks, they tried to find a clock routing tree such that the maximum skew violation among all pairs of sinks is minimized under wire width variation The way they construct the tree follows the framework of the DME algorithm Because of wire width variation, the skew between a sink pair becomes a range rather than a unique value To maximize the safety margin due to process variations, in the bottom-up stage, they chose the merging segment for the tapping point such that the center of the skew range of the most critical sink pair coincides with the center

of permissible range for this sink pair Besides improving process variation tolerance, they also proposed an algorithm to minimize wirelength when there is no skew violation under wire width variation

Recently, Rajaram et al [28] proposed to insert cross links in a given clock tree to improve its skew sensitivity Like the grid and the spine structures, the cross links equalize delay of different points by connecting them together Such an approach can tolerate both process and environmental variations Moreover, because the cross links are selectively inserted based on the trade-off between skew sensitivity reduction and extra wire usage, this approach can achieve significant skew sen-sitivity reduction with little increase in wirelength The link insertion algorithm is improved in Ref [29]

REFERENCES

1 J M Rabaey, A Chandrakasan, and B Nikoli´c Digital Integrated Circuits: A Design Perspective, 2nd edn Prentice Hall, 2003

2 H Veendrick Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer

circuits IEEE Journal of Solid-State Circuits, SC-19:468–473, August 1984.

3 P J Restle, T G McNamara, D A Webber, P J Camporese, K F Eng, K A Jenkins, D H Allen,

M J Rohn, M P Quaranta, D W Boerstler, C J Alpert, C A Carter, R N Bailey, J G Petronick,

B L Krauter, and B D McCredie A clock distribution network for microprocessors IEEE Journal of

Solid-State Circuits, 36(5):792–799, May 2001.

∗In [25], an O (n2log n ) algorithm by adjoint analysis is proposed.

Trang 4

4 M Gowan, L Biro, and D Jackson Power considerations in the design of the Alpha 21264 microprocessor.

In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp 433–439, 1998.

5 D R Gonzales Micro-RISC architecture for the wireless market IEEE Micro, 19(4):30–37, 1999.

6 D E Duarte, N Vijaykrishnan, and M J Irwin A clock power model to evaluate impact of architectural

and technology optimizations IEEE Transactions on Very Large Scale Integration Systems, 10(6):844–855,

December 2002

7 M A B Jackson, A Srinivasan, and E S Kuh Clock routing for high-performance ICs In Proceedings

of the ACM/IEEE Design Automation Conference, Orlando, FL, pp 573–579, 1990.

8 J Cong, A B Kahng, and G Robins Matching-based methods for high-performance clock routing IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(8):1157–1169, August

1993 (DAC 1991)

9 R -S Tsay An exact zero-skew clock routing algorithm IEEE Transactions on Computer-Aided Design

of Integrated Circuits and Systems, 12(2):242–249, February 1993 (ICCAD 1991).

10 W C Elmore The transient response of damped linear network with particular regard to wideband

amplifiers Journal of Applied Physics, 19:55–63, 1948.

11 M Edahiro Minimum skew and minimum path length routing in VLSI layout design NEC Research and

Development, 32(4): 569–575, 1991.

12 T -H Chao, Y -C Hsu, and J -M Ho Zero skew clock net routing In Proceedings of the ACM/IEEE

Design Automation Conference, Anaheim, CA, pp 518–523, 1992.

13 K D Boese and A B Kahng Zero-skew clock routing trees with minimum wirelength In Proceedings of

the IEEE International ASIC Conference, Rochester, NY, pp 1.1.1–1.1.5, September 1992.

14 J G Xi and W W -M Dai Buffer insertion and sizing under process variations for low power clock

distribution In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA,

pp 491–496, 1995

15 A Vittal and M Marek-Sadowska Low-power buffered clock tree design IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, pp 965–975, September 1997 (DAC 1995).

16 S Pullela, N Menezes, J Omar, and L T Pillage Skew and delay optimization for reliable buffered clock

trees In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose,

CA, pp 556–562, 1993

17 B J Benschneider, A J Black, W J Bowhill, S M Britton, D E Dever, D R Donchin, R J Dupcak,

R M Fromm, M K Gowan, P E Gronowski, M Kantrowitz, M E Lamere, S Mehta, J E Meyer, R O Mueller, A Olesin, R P Preston, D A Priore, S Santhanam, M J Smith, and G M Wolrich A 300-MHz

64-b quad-issue CMOS RISC microprocessor IEEE Journal of Solid-State Circuits, 30(11):1203–1214,

November 1995 (ISSCC 1995)

18 Shen Lin and C K Wong Process-variation-tolerant clock skew minimization In Proceedings of the

IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp 284–288, 1994.

19 H Su and S S Sapatnekar Hybrid structured clock network construction In Proceedings of the IEEE/ACM

International Conference on Computer-Aided Design, San Jose, CA, pp 333–336, 2001.

20 J P Fishburn Clock skew optimization IEEE Transactions on Computers, 39(7):945–951, July 1990.

21 J Cong, A B Kahng, C -K Koh, and C -W A Tsao Bounded-skew clock and Steiner routing ACM

Transactions on Design Automation of Electronics Systems, 3(3):341–388, 1998 (ICCAD 1995).

22 J L Neves and E G Friedman Optimal clock skew scheduling tolerant to process variations In Proceedings

of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp 623–628, 1996.

23 K Ravindran, A Kuehlmann, and E Sentovich Multi-domain clock skew scheduling In Proceedings of

the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp 801–808, 2003.

24 L -F Chao and E H -M Sha Retiming and clock skew for synchronous systems In Proceedings of the

IEEE International Symposium on Circuits and Systems, London, England, pp 283–286, 1994.

25 S Pullela, N Menezes, and L T Pillage Reliable non-zero skew clock trees using wire width optimization

In Proceedings of the ACM/IEEE Design Automation Conference, Dallas, TX, pp 165–170, 1993.

26 C -P Chen and D F Wong A fast algorithm for optimal wire-sizing under Elmore delay model

In Proceedings of the IEEE International Symposium on Circuits and Systems, vol 4, Atlanta, GA,

pp 412–415, 1996

Trang 5

27 B Lu, J Hu, G Ellis, and H Su Process variation aware clock tree routing In Proceedings of the

International Symposium on Physical Design, Monterey, CA, pp 174–181, 2003.

28 A Rajaram, J Hu, and R Mahapatra Reducing clock skew variability via cross links In Proceedings of

the ACM/IEEE Design Automation Conference, Anaheim, CA, pp 18–23, 2004.

29 A Rajaram, D Z Pan, and J Hu Improved algorithms for link-based non-tree clock networks for skew

variability reduction In Proceedings of the International Symposium on Physical Design, San Francisco,

CA, pp 55–62, 2005

Trang 6

43 Practical Issues in Clock

Network Design

Chris Chu and Min Pan

CONTENTS

43.1 IBM S/390 898

43.2 IBM Power4 900

43.3 Alpha 21264 901

43.4 Intel Pentium II 904

43.5 Intel Pentium III 905

43.6 Intel Pentium 4 905

43.7 Intel Itanium 907

43.8 Intel Itanium 2 909

References 911

In this chapter, we present the clock network designs of several high-performance microprocessors

to illustrate how the basic techniques presented in Chapter 42 are applied in practice We focus on the clock network design of high-performance microprocessors as the stringent slew requirements make the design most challenging Some useful discussions on practical issues in clock network design can also be found in Bindal and Friedman [1], Zhu [2], and Rusu [3]

single grid

grids

Active 28

trees

Fuse based 24

trees

Active 10

897

Trang 7

The processors discussed in this chapter are summarized in the table above (Some entries are left blank because the corresponding information cannot be found.)

43.1 IBM S/390

The design of a 400-MHz microprocessor for IBM S/390 Enterprise Server Generation-4 system

is described in Ref [4] The chip is fabricated in a 0.2-µm Leff CMOS technology with five layers

of metal and tungsten local interconnect The power supply is 2.5 V The chip size is 17.35 mm× 17.30 mm with about 7.8 million transistors The clock distribution network uses a balanced tree design, which is suitable for the relatively low clock frequency A single-phase clock is distributed from a phase-locked loop (PLL)/central clock buffer located near the center of the chip to all the latches inside the macros in three levels of hierarchy

The first two levels of clock distribution are in the form of balanced H-like trees, using primarily the top two metal layers The first-level tree routes the global clock from the central clock buffer

to nine sector buffers, as shown in Figure 43.1 The sector buffers repower the clock to all macros inside the sectors There are 580 macro clock pins in the whole design

RU

32 KB cache

Directory

TLB_ABS

32 KB cache

CLKD

Address flow Address flow

Instruction flow Instruction flow

Clock sector buffer

Clock waveform measurement point

FIGURE 43.1 First-level tree of the IBM S/390 clock distribution network (From Welb, C.F et al.,

J Solid-State Circuits, 32, 1665, 1997 With permission.)

Trang 8

The clock propagation delay along the tree is balanced against macro input capacitance and

RLC characteristics of the tree wires Horizontal wiring of each tree is in low-resistance Metal 5

(M5) (with 4.8-µm pitch) At various places along the tree, inductive coupling is reduced and return path is improved by using power wires for shielding Decoupling capacitors are incorporated into central and sector buffers to reduce delta-I noise A clock wiring methodology was developed with custom routing and timing computer-aided design (CAD) tools The detailed routing as well as the widths of all clock wires were optimized to minimize skew, mean delay, power, wiring tracks, and sensitivity to process variations Three-dimensional (3D) modeling was performed using a full-wave

electromagnetic field solver [15], and distributed RLC modeling was used for virtually every wire in

all the trees during the design and tuning/optimization process [16] A number of cases were analyzed, and the results were used to generate a combination of analytic models and lookup tables containing

distributed RLC parameters for all clock geometries used Each wire segment was represented by

an equivalent circuit consisting of up to six RLC π-segments Extensive simulations and wire width tuning [17] were done to guarantee low clock skew at macro pins Typical simulated RLC delay of

the first-level tree is 300 ps with 20 ps skew at the sector buffers The sector buffer delay is 230 ps

Typical simulated RLC delay within sectors is 210 ps with 30 ps skew at the macros.

The last level of clock distribution is local to each macro Figure 43.2 shows the clocking scheme within macros From the macro pin, the clocks are wired to clock blocks The overall target skew for this wire is under 20 ps For large area macros, multiple clock pins were used to reduce wirelength

to clock blocks The clock block generates local clocks that drive latches The target skew for local clocks is under 50 ps

All macrolevel wiring is done by hand for custom macros or with a place and route tool for synthesized macros For synthesized macros that had many latches, and therefore multiple clock blocks, a clock optimization tool was used that reassigned latches to clock blocks based on cell placement This resulted in clock blocks driving latches that were placed closest to them Macro

layouts were extracted for R and C parasitics, and the extracted netlists were used to time the

macros This means that any skew in the last level of clock distribution was captured in that macro’s timing abstraction

Figure 43.3 shows the measured waveforms of the central clock buffer output and clocks at ten points of the 580 macro pin locations (marked on Figure 43.1) driven by the second level clock tree The measurement was performed using a novel electron-beam prober with a 20-ps time resolution

on the top wiring layer Because the chip was powered using a standard cantilever probe card in the

CLKG

Clock chopper

Clock splitter

Combinational logic CLKL

C2

C1

L2

FIGURE 43.2 Last/macrolevel clock distribution of IBM S/390 (From Welb, C.F et al., J Solid-State

Circuits, 32, 1665, 1997 With permission.)

Trang 9

2.0

1.5

1.0

0.0

Central clock buffer output

Clock at ten of 580 macro pins in second level clock tree

1000

0.5

FIGURE 43.3 Electron beam measured clock waveforms at macro pin locations marked on Figure 43.1.

(From Welb, C.F et al., J Solid-State Circuits, 32, 1665, 1997 With permission.)

electron-beam prober, the chip clock was run at low frequency to reduce power supply noise Power supply noise during these measurements was measured to be less than 100 mV The results indicate

a mean delay of 740 ps and less than 30 ps skew from the central clock buffer to the macro pins

43.2 IBM POWER4

The clock distribution of a 1.3-GHz Power4 microprocessor is described in Refs [5,18] The chip is fabricated in the IBM 0.18-µm CMOS 8S3 SOI (silicon-on-insulator) technology with seven levels

of copper wiring It has 174 million transistors The power supply is 1.6 V

The microprocessor uses a single chip-wide clock domain, with no active or programmable skew-reduction circuitry Having multiple domains would allow active/programmable deskewing and coarse clock gating, and could result in lower skew within each small domain Inevitably, however, with multiple domains there is increased skew and uncertainty between domains In addition, multiple clock domains complicate early- and late-mode timings, and degrade critical paths that cross multiple domain boundaries Extensive simulations of the Power4 chip and test-chip hardware measurements support the simplifying decision to maintain a single-domain global clock grid for the entire chip, with no programmable or active deskewing

The global clock distribution strategy is based on a topology using a number of tuned trees driving

a single full-chip clock grid [19] This strategy is developed with the goal of being applicable to a variety of high-performance server microprocessors It has been previously used in three S/390 chips and three PowerPC chips [19] The trees-driving-grid topology combines many of the advantages

of both trees and grids Trees have low latency, low power, minimal wiring track usage, and the potential for very low skew However, without the grid, trees must often be rerouted whenever the locations of clock pins change, or when the load capacitance values change significantly The grid provides a constant structure so that the trees and the grids they are driving can be designed early

to distribute the clock near every location where it may be needed The regular grid also allows simple regular tree structures This is important as it facilitates the design of carefully designed transmission line structures with well-controlled capacitance and inductance The grid reduces local skew by connecting nearby points directly The tree wires are then tuned to minimize skew over longer distances

The global clock distribution network of the 1.3-GHz Power4 chip is illustrated in Figure 43.4 using a 3D visualization showing all wire and buffer delays In the network, a PLL near the center

of the chip drives buffered H-trees, which are designed as symmetrically as possible The H-trees

Trang 10

800 700 600

500 400 300 200 100

Grid Tuned sector trees Sector buffers level 4

Buffer level 3

Buffer level 2

Buffer level 1

X Y

FIGURE 43.4 3D visualization of the Power4 global clock distribution (From Restle, P.J et al., Proc IEEE

Intl Solid-State Circuits Conf., 2002, pp 144–145 With permission.)

drive the final set of 64 carefully placed sector buffers Each sector buffer drives a tunable sector tree network, designed for minimum delay without length matching These sector trees are tuned primarily by wire-width tuning Then they all drive a single full-chip clock grid at 1024 evenly spaced points From the global clock grid, a hierarchy of short clock routes completed the connection from the grid down to the individual local clock buffer inputs in the macros There are 15,200 global clock pins

It is reported in Ref [5] that the maximum skew measured at 19 places with picoprobes is 25 ps, and the maximum skew by picosecond imaging for circuit analysis (PICA) measurements from nine sector buffers is less than 18 ps

43.3 ALPHA 21264

The clocking design of a 600-MHz Alpha 21264 microprocessor is presented in Ref [6] The chip is fabricated in a 0.35-µm CMOS process with six metal layers Four metal layers (called M1 to M4)

are for signals, one (between M2 and M3) is for a VSSreference plane, and one (above M4) is for a

VDDreference plane It has 15.2 million transistors This microprocessor employs a hierarchical clock distribution scheme as illustrated in Figure 43.5 At the top level, there is a global clock grid called GCLK, which covers the entire die Next, there are six major clock grids over certain execution units

At the bottom level, local clocks are generated as needed from any clock (global clock, major clocks,

or other local clocks) Previous Alpha microprocessors use a single grid to distribute the global clock signal [20,21] The hierarchical scheme is chosen for this microprocessor because of tighter skew con-straints, the importance of clock power minimization, and the need of a flexible clocking methodology

to solve local timing problems The drawback is that skew management becomes much more compli-cated State elements and clocking points exist from 0 to 8 stages past GCLK The clock distribution network needs to be carefully designed based on rigorous and thorough timing verification The GCLK grid is driven by a global clock distribution network as shown in Figure 43.6 The network connects a PLL located in a corner of the chip to 16 distributed global clock drivers The arrangement of global clock drivers, which resembles four windowpanes, achieves low skew by dividing the chip into regions, thus reducing the maximum distance from the drivers to the farthest loads A windowpane arrangement also reduces sensitivity to process variation because each grid pane is redundantly driven from four sides In general, distributing the drivers widely across the chip

Định dạng
Số trang	10
Dung lượng	214,18 KB