Neves and Friedman [22] formulated the process variation tolerant optimal clock skew scheduling problem.. To better control the effects of process variations, they find the permissible r
Trang 1FF i
logic
FF j
D ij
clk2q
FIGURE 42.11 Clock hazards and timing constraints.
FF i and FFj as shown in Figure 42.11 Let ti and tj be the clock delays from clock source to FFiand
clock-to-Q delay for FFi Let tsetupj and thold
j be the setup time and hold time for FFj, respectively Let
P be the clock period The setup time and hold time constraints can be expressed as
t i + t clk2q
i + MAXD ij
+ tsetup
t i + t clk2q
i + MIND ij
≥ tj + thold
A clock schedule is a set of delays from clock source to all registers in the synchronous system The clock scheduling problem is to find a clock schedule{t1, , t N} for all registers FF1, , FF N
to minimize the clock period P while satisfying the constraints in Equations 42.2 and 42.3 This
problem can be formulated as a linear program as follows [20]:
LP_SPEED: Minimize P
subject to t j − ti ≥ tsetup
j + t clk2q
i + MAXD ij
− P for i, j = 1, , N
t i − tj ≥ thold
j − t clk2q
i − MIND ij
for i, j = 1, , N
t i ≥ MIN_DELAY for i = 1, , N Alternatively, we can find a clock schedule to maximize the minimum safety margin M for a given clock period P This problem can be formulated as a linear program as follows:
LP_SAFETY: Maximize M
subject to t j − ti ≥ tsetup
j + t clk2q
i + MAXD ij
− P + M for i, j = 1, , N
t i − tj ≥ thold
j − t clk2q
i − MIND ij
+ M for i, j = 1, , N
In both formulations, MAX(D ij ) = −∞ and MIN(D ij ) = ∞ if there is no combinational path from
FF i to FFj.
After the clock schedule S = {t1, , t N} is computed, the next step is to construct a clock network
to realize the obtained schedule The DME algorithm in Section 42.2.4 can be easily extended to handle this problem We only need to construct the merging segments to achieve the given skews instead of zero skews in the bottom-up phase of the DME algorithm However, the solutions of the
linear programs may not be unique Each clock delay t icould be a range rather than a fixed value
In this case, the clock routing problem becomes the bounded-skew routing tree (BST) problem In Ref [21], Cong et al proposed two algorithms, BME (boundary merging and embedding) and IME (interior merging and embedding), to handle this problem These two algorithms extend the DME algorithm by finding a polygonal region based on the skew bounds rather than a merging segment
to represent all possible locations for each tapping point
Trang 2Apart from the original formulations of clock scheduling, there are some other extensions Neves and Friedman [22] formulated the process variation tolerant optimal clock skew scheduling problem
To better control the effects of process variations, they find the permissible range (i.e., the range of the clock skew without timing violation) for each local path, select a clock skew value that allows
a maximum variation of skew within the permissible range, and finally determine the clock delay
to each register Recently, Ravindran et al [23] discussed the multidomain clock skew scheduling
problem For a given number of clocking domains n and a maximum permissible within-domain
latencyδ, the multidomain range constraints require that all clock latencies must fit into n value
ranges(l(d i ), l(d i ) + δ) for i = 1, , n The objective of multidomain clock skew scheduling is to determine domain phase shifts l (d i ) and register latencies that satisfy the clock domain constraints
and minimize the clock period
Finally, we want to have a brief discussion on two similar sequential optimization techniques, clock scheduling and retiming They are, respectively, continuous and discrete optimizations with the same effect on minimizing the clock period [20] The equivalence of the two techniques was
studied in Ref [24] It is proved that there exists a retiming R to achieve clock period P if and only
if there exists a clock schedule S with the same clock period However, the practical use of retiming
is limited due to two reasons First, retiming has adverse impact on the verification methodology Second, using retiming for maximum performance often causes a steep increase in the number
of registers Clock scheduling does not have these two limitations Another advantage of clock scheduling is that because retiming can only move registers across discrete amounts of logic delay, the resulting system after retiming can still benefit from clock scheduling
42.5 HANDLING VARIABILITY
In minimizing skew sensitivity to process variations, two guiding principles are that the network should be as symmetrical and as fast as possible In a clock network designed and laid out sym-metrically, chipwide process (or environmental) variations should affect all clock paths identically
An additional advantage is that any systematic skew caused by modeling errors is eliminated by symmetry In a fast network, as the clock phase delay is small, any fractional variations in delay lead only to a modest amount of skew In addition, a clock network with optimal delay is the most tolerant to process variations At the optimal delay point with respect to a certain parameter, the delay sensitivity over that parameter (i.e., the slope of the delay function) should be zero However,
it is not trivial to apply these two principles in practice Because of uneven load distribution and routing/buffer obstacles, it is usually impossible to construct a completely symmetrical network Moreover, minimizing the network delay may be conflicting with the optimization of some other metrics (e.g., skew, area and power) Several important works on reliable clock network design under process variations are discussed below
The concept of delay sensitivity is very useful in considering process variations Pullela et al [25] first made use of delay sensitivity with respect to wire width variations to improve the delay, skew and skew sensitivity of a given clock tree by wire width optimization The Elmore delay model and
the L-type RC model for each branch are used in the paper, but the concept can be generalized to other models Let Rj be the resistance, Cj be the capacitance, and C djbe the downstream capacitance
of branch j Let U (i) be the set of all branches on the path from sink i to the root Then the Elmore delay from the root to sink i is T di =j ∈U(i) R j C dj Therefore, the sensitivities of Elmore delay of
sink i with respect to circuit parameters Cj and Rj are
∂T di
∂C j
∂T di
C dj if j ∈ U (i)
Trang 3where R cij is the total resistance along the common path from sink i to the root and branch j to the root R j and C j can be expressed as functions of width w j of branch j as R j = R0 L j /w j and
C j = Ca L j w j + Cf L j , where R0, C a , and C f are technology parameters independent of w j , and L jis
the length of branch j Therefore, the delay sensitivity of sink i to width w j is
∂T di
∂w j
=∂T di
∂C j
∂C j
∂w j
+∂T di
∂R j
∂R j
∂w j
=∂T di
∂C j
C a L j−∂T di
∂R j
R0L j
w2
j
By incremental computation as described in Ref [26], Equations 42.4 and 42.5 for all i and j can
be computed in O (n2) time for a tree with n sinks.∗Hence, the delay sensitivities for all sinks to all
branch widths can also be found in O (n2) time.
In Ref [25], a greedy heuristics is proposed to iteratively increase the widths to improve delay, skew, and skew sensitivity The selection of the branch to widen in each step is based on the delay sensitivities, which give the delay change of each sink when widening a branch In particular, they argued that wire widening is a better method for delay balancing than wire elongating as widening generally reduces skew sensitivity but elongating increases it
Lu et al [27] formulated the minimizing skew violation (MinSV) problem to construct a clock tree considering wire width variation due to process variations Given the range of permissible skew for each pair of clock sinks, they tried to find a clock routing tree such that the maximum skew violation among all pairs of sinks is minimized under wire width variation The way they construct the tree follows the framework of the DME algorithm Because of wire width variation, the skew between a sink pair becomes a range rather than a unique value To maximize the safety margin due to process variations, in the bottom-up stage, they chose the merging segment for the tapping point such that the center of the skew range of the most critical sink pair coincides with the center
of permissible range for this sink pair Besides improving process variation tolerance, they also proposed an algorithm to minimize wirelength when there is no skew violation under wire width variation
Recently, Rajaram et al [28] proposed to insert cross links in a given clock tree to improve its skew sensitivity Like the grid and the spine structures, the cross links equalize delay of different points by connecting them together Such an approach can tolerate both process and environmental variations Moreover, because the cross links are selectively inserted based on the trade-off between skew sensitivity reduction and extra wire usage, this approach can achieve significant skew sen-sitivity reduction with little increase in wirelength The link insertion algorithm is improved in Ref [29]
REFERENCES
1 J M Rabaey, A Chandrakasan, and B Nikoli´c Digital Integrated Circuits: A Design Perspective, 2nd edn Prentice Hall, 2003
2 H Veendrick Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer
circuits IEEE Journal of Solid-State Circuits, SC-19:468–473, August 1984.
3 P J Restle, T G McNamara, D A Webber, P J Camporese, K F Eng, K A Jenkins, D H Allen,
M J Rohn, M P Quaranta, D W Boerstler, C J Alpert, C A Carter, R N Bailey, J G Petronick,
B L Krauter, and B D McCredie A clock distribution network for microprocessors IEEE Journal of
Solid-State Circuits, 36(5):792–799, May 2001.
∗In [25], an O (n2log n ) algorithm by adjoint analysis is proposed.
Trang 44 M Gowan, L Biro, and D Jackson Power considerations in the design of the Alpha 21264 microprocessor.
In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp 433–439, 1998.
5 D R Gonzales Micro-RISC architecture for the wireless market IEEE Micro, 19(4):30–37, 1999.
6 D E Duarte, N Vijaykrishnan, and M J Irwin A clock power model to evaluate impact of architectural
and technology optimizations IEEE Transactions on Very Large Scale Integration Systems, 10(6):844–855,
December 2002
7 M A B Jackson, A Srinivasan, and E S Kuh Clock routing for high-performance ICs In Proceedings
of the ACM/IEEE Design Automation Conference, Orlando, FL, pp 573–579, 1990.
8 J Cong, A B Kahng, and G Robins Matching-based methods for high-performance clock routing IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(8):1157–1169, August
1993 (DAC 1991)
9 R -S Tsay An exact zero-skew clock routing algorithm IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 12(2):242–249, February 1993 (ICCAD 1991).
10 W C Elmore The transient response of damped linear network with particular regard to wideband
amplifiers Journal of Applied Physics, 19:55–63, 1948.
11 M Edahiro Minimum skew and minimum path length routing in VLSI layout design NEC Research and
Development, 32(4): 569–575, 1991.
12 T -H Chao, Y -C Hsu, and J -M Ho Zero skew clock net routing In Proceedings of the ACM/IEEE
Design Automation Conference, Anaheim, CA, pp 518–523, 1992.
13 K D Boese and A B Kahng Zero-skew clock routing trees with minimum wirelength In Proceedings of
the IEEE International ASIC Conference, Rochester, NY, pp 1.1.1–1.1.5, September 1992.
14 J G Xi and W W -M Dai Buffer insertion and sizing under process variations for low power clock
distribution In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA,
pp 491–496, 1995
15 A Vittal and M Marek-Sadowska Low-power buffered clock tree design IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, pp 965–975, September 1997 (DAC 1995).
16 S Pullela, N Menezes, J Omar, and L T Pillage Skew and delay optimization for reliable buffered clock
trees In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose,
CA, pp 556–562, 1993
17 B J Benschneider, A J Black, W J Bowhill, S M Britton, D E Dever, D R Donchin, R J Dupcak,
R M Fromm, M K Gowan, P E Gronowski, M Kantrowitz, M E Lamere, S Mehta, J E Meyer, R O Mueller, A Olesin, R P Preston, D A Priore, S Santhanam, M J Smith, and G M Wolrich A 300-MHz
64-b quad-issue CMOS RISC microprocessor IEEE Journal of Solid-State Circuits, 30(11):1203–1214,
November 1995 (ISSCC 1995)
18 Shen Lin and C K Wong Process-variation-tolerant clock skew minimization In Proceedings of the
IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp 284–288, 1994.
19 H Su and S S Sapatnekar Hybrid structured clock network construction In Proceedings of the IEEE/ACM
International Conference on Computer-Aided Design, San Jose, CA, pp 333–336, 2001.
20 J P Fishburn Clock skew optimization IEEE Transactions on Computers, 39(7):945–951, July 1990.
21 J Cong, A B Kahng, C -K Koh, and C -W A Tsao Bounded-skew clock and Steiner routing ACM
Transactions on Design Automation of Electronics Systems, 3(3):341–388, 1998 (ICCAD 1995).
22 J L Neves and E G Friedman Optimal clock skew scheduling tolerant to process variations In Proceedings
of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp 623–628, 1996.
23 K Ravindran, A Kuehlmann, and E Sentovich Multi-domain clock skew scheduling In Proceedings of
the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp 801–808, 2003.
24 L -F Chao and E H -M Sha Retiming and clock skew for synchronous systems In Proceedings of the
IEEE International Symposium on Circuits and Systems, London, England, pp 283–286, 1994.
25 S Pullela, N Menezes, and L T Pillage Reliable non-zero skew clock trees using wire width optimization
In Proceedings of the ACM/IEEE Design Automation Conference, Dallas, TX, pp 165–170, 1993.
26 C -P Chen and D F Wong A fast algorithm for optimal wire-sizing under Elmore delay model
In Proceedings of the IEEE International Symposium on Circuits and Systems, vol 4, Atlanta, GA,
pp 412–415, 1996
Trang 527 B Lu, J Hu, G Ellis, and H Su Process variation aware clock tree routing In Proceedings of the
International Symposium on Physical Design, Monterey, CA, pp 174–181, 2003.
28 A Rajaram, J Hu, and R Mahapatra Reducing clock skew variability via cross links In Proceedings of
the ACM/IEEE Design Automation Conference, Anaheim, CA, pp 18–23, 2004.
29 A Rajaram, D Z Pan, and J Hu Improved algorithms for link-based non-tree clock networks for skew
variability reduction In Proceedings of the International Symposium on Physical Design, San Francisco,
CA, pp 55–62, 2005
Trang 643 Practical Issues in Clock
Network Design
Chris Chu and Min Pan
CONTENTS
43.1 IBM S/390 898
43.2 IBM Power4 900
43.3 Alpha 21264 901
43.4 Intel Pentium II 904
43.5 Intel Pentium III 905
43.6 Intel Pentium 4 905
43.7 Intel Itanium 907
43.8 Intel Itanium 2 909
References 911
In this chapter, we present the clock network designs of several high-performance microprocessors
to illustrate how the basic techniques presented in Chapter 42 are applied in practice We focus on the clock network design of high-performance microprocessors as the stringent slew requirements make the design most challenging Some useful discussions on practical issues in clock network design can also be found in Bindal and Friedman [1], Zhu [2], and Rusu [3]
single grid
grids
grids
Active 28
trees
trees
Fuse based 24
trees
Active 10
897
Trang 7The processors discussed in this chapter are summarized in the table above (Some entries are left blank because the corresponding information cannot be found.)
43.1 IBM S/390
The design of a 400-MHz microprocessor for IBM S/390 Enterprise Server Generation-4 system
is described in Ref [4] The chip is fabricated in a 0.2-µm Leff CMOS technology with five layers
of metal and tungsten local interconnect The power supply is 2.5 V The chip size is 17.35 mm× 17.30 mm with about 7.8 million transistors The clock distribution network uses a balanced tree design, which is suitable for the relatively low clock frequency A single-phase clock is distributed from a phase-locked loop (PLL)/central clock buffer located near the center of the chip to all the latches inside the macros in three levels of hierarchy
The first two levels of clock distribution are in the form of balanced H-like trees, using primarily the top two metal layers The first-level tree routes the global clock from the central clock buffer
to nine sector buffers, as shown in Figure 43.1 The sector buffers repower the clock to all macros inside the sectors There are 580 macro clock pins in the whole design
RU
32 KB cache
Directory
TLB_ABS
32 KB cache
CLKD
Address flow Address flow
Instruction flow Instruction flow
Clock sector buffer
Clock waveform measurement point
FIGURE 43.1 First-level tree of the IBM S/390 clock distribution network (From Welb, C.F et al.,
J Solid-State Circuits, 32, 1665, 1997 With permission.)
Trang 8The clock propagation delay along the tree is balanced against macro input capacitance and
RLC characteristics of the tree wires Horizontal wiring of each tree is in low-resistance Metal 5
(M5) (with 4.8-µm pitch) At various places along the tree, inductive coupling is reduced and return path is improved by using power wires for shielding Decoupling capacitors are incorporated into central and sector buffers to reduce delta-I noise A clock wiring methodology was developed with custom routing and timing computer-aided design (CAD) tools The detailed routing as well as the widths of all clock wires were optimized to minimize skew, mean delay, power, wiring tracks, and sensitivity to process variations Three-dimensional (3D) modeling was performed using a full-wave
electromagnetic field solver [15], and distributed RLC modeling was used for virtually every wire in
all the trees during the design and tuning/optimization process [16] A number of cases were analyzed, and the results were used to generate a combination of analytic models and lookup tables containing
distributed RLC parameters for all clock geometries used Each wire segment was represented by
an equivalent circuit consisting of up to six RLC π-segments Extensive simulations and wire width tuning [17] were done to guarantee low clock skew at macro pins Typical simulated RLC delay of
the first-level tree is 300 ps with 20 ps skew at the sector buffers The sector buffer delay is 230 ps
Typical simulated RLC delay within sectors is 210 ps with 30 ps skew at the macros.
The last level of clock distribution is local to each macro Figure 43.2 shows the clocking scheme within macros From the macro pin, the clocks are wired to clock blocks The overall target skew for this wire is under 20 ps For large area macros, multiple clock pins were used to reduce wirelength
to clock blocks The clock block generates local clocks that drive latches The target skew for local clocks is under 50 ps
All macrolevel wiring is done by hand for custom macros or with a place and route tool for synthesized macros For synthesized macros that had many latches, and therefore multiple clock blocks, a clock optimization tool was used that reassigned latches to clock blocks based on cell placement This resulted in clock blocks driving latches that were placed closest to them Macro
layouts were extracted for R and C parasitics, and the extracted netlists were used to time the
macros This means that any skew in the last level of clock distribution was captured in that macro’s timing abstraction
Figure 43.3 shows the measured waveforms of the central clock buffer output and clocks at ten points of the 580 macro pin locations (marked on Figure 43.1) driven by the second level clock tree The measurement was performed using a novel electron-beam prober with a 20-ps time resolution
on the top wiring layer Because the chip was powered using a standard cantilever probe card in the
CLKG
Clock chopper
Clock splitter
Combinational logic CLKL
C2
C1
L2
FIGURE 43.2 Last/macrolevel clock distribution of IBM S/390 (From Welb, C.F et al., J Solid-State
Circuits, 32, 1665, 1997 With permission.)
Trang 92.0
1.5
1.0
0.0
Central clock buffer output
Clock at ten of 580 macro pins in second level clock tree
1000
0.5
FIGURE 43.3 Electron beam measured clock waveforms at macro pin locations marked on Figure 43.1.
(From Welb, C.F et al., J Solid-State Circuits, 32, 1665, 1997 With permission.)
electron-beam prober, the chip clock was run at low frequency to reduce power supply noise Power supply noise during these measurements was measured to be less than 100 mV The results indicate
a mean delay of 740 ps and less than 30 ps skew from the central clock buffer to the macro pins
43.2 IBM POWER4
The clock distribution of a 1.3-GHz Power4 microprocessor is described in Refs [5,18] The chip is fabricated in the IBM 0.18-µm CMOS 8S3 SOI (silicon-on-insulator) technology with seven levels
of copper wiring It has 174 million transistors The power supply is 1.6 V
The microprocessor uses a single chip-wide clock domain, with no active or programmable skew-reduction circuitry Having multiple domains would allow active/programmable deskewing and coarse clock gating, and could result in lower skew within each small domain Inevitably, however, with multiple domains there is increased skew and uncertainty between domains In addition, multiple clock domains complicate early- and late-mode timings, and degrade critical paths that cross multiple domain boundaries Extensive simulations of the Power4 chip and test-chip hardware measurements support the simplifying decision to maintain a single-domain global clock grid for the entire chip, with no programmable or active deskewing
The global clock distribution strategy is based on a topology using a number of tuned trees driving
a single full-chip clock grid [19] This strategy is developed with the goal of being applicable to a variety of high-performance server microprocessors It has been previously used in three S/390 chips and three PowerPC chips [19] The trees-driving-grid topology combines many of the advantages
of both trees and grids Trees have low latency, low power, minimal wiring track usage, and the potential for very low skew However, without the grid, trees must often be rerouted whenever the locations of clock pins change, or when the load capacitance values change significantly The grid provides a constant structure so that the trees and the grids they are driving can be designed early
to distribute the clock near every location where it may be needed The regular grid also allows simple regular tree structures This is important as it facilitates the design of carefully designed transmission line structures with well-controlled capacitance and inductance The grid reduces local skew by connecting nearby points directly The tree wires are then tuned to minimize skew over longer distances
The global clock distribution network of the 1.3-GHz Power4 chip is illustrated in Figure 43.4 using a 3D visualization showing all wire and buffer delays In the network, a PLL near the center
of the chip drives buffered H-trees, which are designed as symmetrically as possible The H-trees
Trang 10800 700 600
500 400 300 200 100
Grid Tuned sector trees Sector buffers level 4
Buffer level 3
Buffer level 2
Buffer level 1
X Y
FIGURE 43.4 3D visualization of the Power4 global clock distribution (From Restle, P.J et al., Proc IEEE
Intl Solid-State Circuits Conf., 2002, pp 144–145 With permission.)
drive the final set of 64 carefully placed sector buffers Each sector buffer drives a tunable sector tree network, designed for minimum delay without length matching These sector trees are tuned primarily by wire-width tuning Then they all drive a single full-chip clock grid at 1024 evenly spaced points From the global clock grid, a hierarchy of short clock routes completed the connection from the grid down to the individual local clock buffer inputs in the macros There are 15,200 global clock pins
It is reported in Ref [5] that the maximum skew measured at 19 places with picoprobes is 25 ps, and the maximum skew by picosecond imaging for circuit analysis (PICA) measurements from nine sector buffers is less than 18 ps
43.3 ALPHA 21264
The clocking design of a 600-MHz Alpha 21264 microprocessor is presented in Ref [6] The chip is fabricated in a 0.35-µm CMOS process with six metal layers Four metal layers (called M1 to M4)
are for signals, one (between M2 and M3) is for a VSSreference plane, and one (above M4) is for a
VDDreference plane It has 15.2 million transistors This microprocessor employs a hierarchical clock distribution scheme as illustrated in Figure 43.5 At the top level, there is a global clock grid called GCLK, which covers the entire die Next, there are six major clock grids over certain execution units
At the bottom level, local clocks are generated as needed from any clock (global clock, major clocks,
or other local clocks) Previous Alpha microprocessors use a single grid to distribute the global clock signal [20,21] The hierarchical scheme is chosen for this microprocessor because of tighter skew con-straints, the importance of clock power minimization, and the need of a flexible clocking methodology
to solve local timing problems The drawback is that skew management becomes much more compli-cated State elements and clocking points exist from 0 to 8 stages past GCLK The clock distribution network needs to be carefully designed based on rigorous and thorough timing verification The GCLK grid is driven by a global clock distribution network as shown in Figure 43.6 The network connects a PLL located in a corner of the chip to 16 distributed global clock drivers The arrangement of global clock drivers, which resembles four windowpanes, achieves low skew by dividing the chip into regions, thus reducing the maximum distance from the drivers to the farthest loads A windowpane arrangement also reduces sensitivity to process variation because each grid pane is redundantly driven from four sides In general, distributing the drivers widely across the chip