clock PLL Conditioned local clocks Conditioned local clocks State elements Local clocks Local clocks Cond Cond GCLK grid Major clk grid FIGURE 43.5 Alpha 21264 clock hierarchy.. Major cl
Trang 1clock
PLL
Conditioned local clocks
Conditioned local clocks State elements
Local clocks
Local clocks Cond
Cond
GCLK grid
Major clk grid
FIGURE 43.5 Alpha 21264 clock hierarchy (From Bailey, D.W and Behschneider, B.J., IEEE J Solid-State
Circuits, 33, 1627, 1998 With permission.)
PLL
FIGURE 43.6 Global clock distribution network of Alpha 21264 (From Bailey, D.W and Behschneider, B.J.,
IEEE J Solid-State Circuits, 33, 1627, 1998 With permission.)
also helps power-supply and heat-dissipation problems The GCLK grid is shown in Figure 43.7
It traverses the entire die and uses 3 percent of M3 and M4 All clock interconnect is laterally shielded
with either VDDor VSS All clock wires and all lateral shields are manually placed The measured GCLK skew is 65 ps running at 0◦C ambient and 2.2 V
The six major clocks are two gain stages past GCLK with grids juxtaposed with GCLK, but shielded from it The major clock grids are shown in Figure 43.8 Because of the wide variation of clock loads, the grid density varies widely between major clocks, and sometimes even for a single major clock The densest areas use up to 6 percent of M3 and M4 Major clocks driven by a gridded global clock substantially reduce power because major clock drivers are localized to the clock loads and major clock grids are locally sized to meet the skew targets A gridded global clock without major
Trang 2FIGURE 43.7 GCLK of Alpha 21264 (From Bailey, D.W and Behschneider, B.J., IEEE J Solid-State
Circuits, 33, 1627, 1998 With permission.)
clocks would require larger drivers and a denser grid to deliver the same clock skew and edges Major clocks are designed so that delay from GCLK is centered at 300 ps The target specifications for skew are±50 ps The target specifications for 10–90 percent rise and fall times are less than 320 ps All major clocks easily meet both sets of objectives
PCLK
ECLK
JCLK
FIGURE 43.8 Six major clock grids of Alpha 21264 (From Bailey, D.W and Behschneider, B.J., IEEE
J Solid-State Circuits, 33, 1627, 1998 With permission.)
Trang 3Local clocks are generally neither gridded nor shielded There are no strict limits on the number, size, or logic function of local-clock buffers, and there is no duty-cycle requirement, although timing path constraints must always be met Local clocks have permitted ranges for clock rise and fall times, but with only this restriction there is considerable design freedom As a result, it facilitates the implementation of clock gating to reduce power and clock skew scheduling to improve performance Because, rather dense grid structures are required to meet the aggressive skew targets, the clock power consumption is very significant At 600 MHz and 2.2 V, typical power usage for the processor
is 72 W The complete distribution network that drives GCLK uses 5.8 W, and GCLK uses 10.2 W The major clocks use 14.0 W Local unconditional clocks use 7.6 W, and local conditional clocks use
a maximum of 15.6 W, assuming they switch every cycle
The clock distribution network design for a 1.2-GHz Alpha 21364 microprocessor can be found
in Ref [22] We choose not to include the details here as Compaq, which acquired DEC in 1998, decided to phase out Alpha on 2001
43.4 INTEL PENTIUM II
The clock distribution network design for a 300-MHz Intel Pentium II microprocessor is presented
in Refs [7,23] The chip is fabricated in a 0.35-µm CMOS process with four metal layers The power supply is 2.8 V The chip has 7.5 million transistors and the die area is 203 mm2 This processor uses
a single spine scheme to distribute the global clock as shown in Figure 43.9 The spine is driven by a balanced tree with five levels of buffers Global clock is distributed to all units in M4 The measured skew is also shown in Figure 43.9 The skew across M4 global distribution is 140 ps The low skew is achieved by balancing the load of each global clock tapping and adjusting global clock track length
SK = ⫺ 564 ps
SK = ⫺ 476 ps
SK = ⫺ 488 ps
SK = ⫺ 592 ps
SK = ⫺ 460 ps
Input point to local buffers with clock gating
SK = ⫺ 424 ps
SK = ⫺ 548 ps
Five-level driver for 500 pF load with M4 metal strapping ring
FIGURE 43.9 Global clock distribution network of Pentium II with electron beam measured skew SK is the
skew relative to feedback point from local buffer (From Young, I.A., Mar, M.F., and Bushan, B., Proc IEEE Intel Solid-State Circuits Conf., pp 330–331, 1997 With permission.)
Trang 443.5 INTEL PENTIUM III
The design of an Intel Pentium III microprocessor is presented in Ref [8] This chip has an operating voltage of 1.4–2.2 V and is running up to 650 MHz It is fabricated in a 0.25-µm CMOS process with five metal layers It has 9.5 million transistors and the chip size is 10.17 mm× 12.10 mm This processor uses a two-spine scheme for global clock distribution A two-spine clock block diagram is shown in Figure 43.10 The two spines were shielded properly such that they would not be impacted
by the fringing fields from any interconnects associated with the core as well as I/O sides The two-spine scheme has many benefits over a single-spine approach First, the serpentine wires can be shortened, and hence power consumption can be reduced Second, power distribution to the clock subsystem becomes easier as the clock power demand is more spread out Third, shielding of clock network is also easier as shields are more readily available on sides than in the center Fourth, routing congestion can be improved because there will not be a center spine running through the center part
of the chip, which is typically most congested
Skew minimization between the two spines is a major challenge Because of the lengthy left and right clock spines with multiple tap points, it was very difficult to match the delays with good accuracy In addition to precision capacitance matching techniques on the global clock tree, an adaptive digital deskewing technique based on a delay-locked loop (DLL) was employed [24] The deskewing circuit is composed of delay lines to both spines, a phase detection circuit, and a controller (Figure 43.10) The phase detection circuit determines the phase relationship between the two spines and generates an output accordingly The controller takes the phase detection information and makes
a discrete adjustment to one of the delay lines The digital delay line is implemented with two inverters in series Each inverter has a bank of eight capacitive loads connecting to the output The addition or removal of the capacitive loads is controlled by the delay shift register This allows 17 monotonic discrete steps of delay Latency from sampling clocks to making adjustment to the delay lines is just over three cycles Note that this DLL-based deskewing scheme compensates for not only interconnect/device mismatch but also process, voltage, and temperature variations Adaptive deskewing helped to reduce the left-to-right clock spine skews from 100 to 15 ps
43.6 INTEL PENTIUM 4
The clocking scheme of a 2-GHz Intel Pentium 4 microprocessor is presented in Ref [9] The chip
is fabricated in a 0.18-µm CMOS process with six metal layers The chip has 42 million transistors and the die area is 217 mm2
Core
PD
clk_Gen
Delay line Delay line
Delay SR Deskew cti Delay SR
X clk
FB clk
FIGURE 43.10 Block diagram for two-spine global clock distribution of Pentium III (From Senthinathan,
R., Fischer, S., Rangchi, H., and Yazdanmehr, H., IEEE J Solid-State Circuits, 3, 1454, 1999 With permission.)
Trang 5FIGURE 43.11 Three spines in a 0.18-µm Pentium 4 (From Kurd, N., Barkatullah, J., and Dizon, R., IEEE
J Solid-State Circuits, 36, 1647, 2001 With permission.)
To cover the large Pentium 4 die, its global clock distribution uses three spines as shown in Figure 43.11 A modified buffered binary tree is used to distribute the global clock from the clock generator to the spines Then 47 domain buffers are driven, producing 47 independent clock domains (Figure 43.12) Domain buffers can be disabled to power down large functional units to save power The clock distribution network includes static skew optimization capability to correct systematic skew (caused by asymmetric layout or within-die process variation) as well as provide intentional skew Each domain buffer consists of a programmable delay stage controlled by a 5-bit domain deskew register (DDR) that determines the edge timing of the domain clock The values of the DDRs can be set according to phase information obtained by a phase-detector network of 46 phase detectors
From PLL
FIGURE 43.12 Global clock distribution in Pentium 4 (From Kurd, N., Barkatullah, J., and Dizon, R., IEEE
J Solid-State Circuits, 36, 1647, 2001 With permission.)
Trang 610.7 mm
Clock stripes
FIGURE 43.13 Eight stripes (i.e., spines) in a 90-nm Pentium 4 (From Bindal, N et al., Proc IEEE Intel.
Solid-State Circuits Conf., pp 346–498, 2003.)
This deskewing scheme can reduce interdomain skew from 64 to about 16 ps A major component
of the clock distribution jitter is due to supply noise from logic switching To reduce supply-noise
induced jitter, an RC-filtered power supply is used for global clock drivers.
The clock distribution design for a next generation Pentium 4 microprocessor that scales to
5 GHz is described in Ref [10] The chip is implemented in a 1.2 V, 90-nm dual-Vt process with seven metal layers The die size is 10.2 mm× 10.7 mm
The clock network consists of a pre-global clock network (PGCN), a global clock grid (GCG), and local clocking The PCGN comprises 12 inversion stages from the PLL to the die center, and 15 stages
to the input of more than 1400 GCG drivers It has a tree structure with strategic shorting of inputs to adjacent receivers within a stage to eliminate skew accumulation over multiple stages because of ran-dom variations Shorting of adjacent receivers provides a very gradual clock skew gradient at the input
to adjacent GCG drivers The GCG consists of eight spines spaced roughly 1200µm apart, as shown
in Figure 43.13 The local scheme consists of two stages of gated buffering The first stage is used for reducing power consumption through clock gating The second stage is reserved for functional gating The design achieves less than 10 ps of global clock skew The final grid stage and its driver dissipate 1.75 W/GHz in addition to 0.75 W/GHz in the PGCN Overall die area allocation ranges from 0.25 percent for devices and lower metals, to less than 2, 3, and 5 percent for M5, M6, and M7 layers, respectively
43.7 INTEL ITANIUM
The clock design of an 800-MHz Itanium microprocessor is presented in Ref [11] The microproces-sor is the first implementation of Intel’s IA-64 architecture Its core contains 25.4 million transistors and is fabricated on a 0.18-µm, six layer metal CMOS process The high level of integration requires
a significant silicon real estate and high clock loading The large die size and the small feature size result in prominent within-die process variation Hence, the Itanium processor uses an active deskewing scheme in conjunction with a combined balanced clock tree and clock grid to distribute the clock over the die The design also provides enough flexibility for the local clock implementation
to support intentional clock skew and time borrowing
Trang 7Reference clock
Global distribution
Local distribution
Regional distribution
GCLK
CLKP CLKN VCC/2
Main clock
RCD
RCD
DSK PLL
DSK
DLCLK OTB
FIGURE 43.14 Clock distribution topology of the Itanium microprocessor (From Tam, S et al., IEEE
J Solid-State Circuits, 35, 1545, 2000 With permission.)
DSK
DSK DSK
PLL DSK
FIGURE 43.15 Global core H-tree of the Itanium microprocessor (From Tam, S et al., IEEE J Solid-State
Circuits, 35, 1545, 2000 With permission.)
The clock system architecture is shown in Figure 43.14 The clock topology is partitioned into global distribution, regional distribution, and local distribution
In the global distribution, a core clock and a reference clock are routed from a PLL clock generator
to eight deskew clusters via two identical and balanced H-trees A schematic drawing of the global core clock tree is shown in Figure 43.15 The global clock tree is implemented exclusively in the two highest level metal layers To reduce capacitive noise coupling and to ensure good inductive return
path, the tree is fully shielded laterally with VDD/VSS In addition, inductive reflections at the branch points are minimized by properly sizing the metal widths for impedance matching
The regional clock distribution encompasses the deskew buffer, the regional clock driver (RCD), and the regional clock grid There are 30 separate clock regions each consisting of the above three elements The 30 regional clocks are illustrated in Figure 43.16 Each of the eight deskew clusters consists of four distinct deskew buffers Because 32 deskew buffers are available, two of them are unused The deskew buffer is connected to the RCDs by a binary distribution network, which uses top layer metals with complete lateral shielding The RCDs are located at the top and bottom of the regional clock grid The grid is implemented using M4 and M5 As with the global clock network,
it contains full lateral shielding to ensure low capacitance coupling and good inductive return paths The regional clock grid utilizes up to 3.5 percent of the available M5 and up to 4.1 percent of the available M4 routing over a region
The deskew buffer architecture is shown in Figure 43.17 It is a digitally controlled DLL structure
A phase detector residing within the local controller of the deskew buffer analyzes the phase difference
Trang 8DSK CDC
CDC
DSK
DSK
= Cluster of four deskew buffers
= Central deskew controller
FIGURE 43.16 Thirty regional clocks of the Itanium microprocessor (From Tam, S et al., IEEE J Solid-State
Circuits, 35, 1545, 2000 With permission.)
RCD
RCD
Regional clock grid
Deskew buffer
Delay circuit Global clock
TAP I/F
Ref clock
Local controller
FIGURE 43.17 Deskew buffer architecture of the Itanium microprocessor (From Tam, S et al., IEEE J.
Solid-State Circuits, 35, 1545, 2000 With permission.)
between the reference clock and a local feedback clock sampled from the regional clock grid Then the core clock delay is adjusted through a digitally controlled analogue delay line Experimental skew measurements show that the total skew is 28 ps with deskewing and is 110 ps without deskewing The local clock distribution consists of local clock buffers (LCBs) and local clock routings that are embedded within a functional unit The LCBs receive the input directly from the regional clock grid and then drive the clocked sequential elements
43.8 INTEL ITANIUM 2
The clock distribution of the 1-GHz Itanium 2 processor is described in Refs [12,25] The chip is fabricated on a 180-nm CMOS process with six layers of aluminum interconnects The processor has 25 million logic transistors and 221 million total transistors The die size is 21.6 mm× 19.5 mm
Trang 9B
Core
primary
driver
Repeaters SLCBs
Gaters L2R
To PLL A FSB clock
C
From latch pipe
FIGURE 43.18 Clock distribution of the Itanium 2 microprocessor (From Anderson, F.E., et al., Proc IEEE
Intel Solid-State Circuits Conf., pp 146–147, 2002 With permission.)
The clock network of this processor is shown in Figure 43.18 Similar to that of the Itanium processor in Section 43.7, it can be partitioned into global distribution (L1R), regional distribution (L2R driven by second-level clock-buffers [SLCBs]), and local distribution (driven by gaters) How-ever, it has three significant differences from that of Itanium First, the global clock network, which
is also implemented as a balanced H-tree, applies differential routing to reduce jitter from supply noise, injected common mode noise, and signal slew rates It is also heavily shielded to reduce jitter because of coupled noise Second, instead of grids, the regional distribution makes use of width and length balanced side-shielded H-trees Third, deskewing technique is not utilized The skew is minimized by precisely tuning the delay of the H-trees It achieves a skew of 62 ps
The clock distribution of a more advanced Itanium 2 processor is presented in Ref [13] This chip is fabricated on a 130-nm CMOS process with six layers of copper interconnects It operates at 1.5 GHz at 1.3 V It has a total of 410 million transistors with a die size of 374 mm2
The main difference from its 180-nm predecessor is that this design implements a fuse-based deskewing technique to address the clock skew issue and to increase the frequency of operation There are 23 regional clocks in the core The SLCB associated with each region contains a 5-bit register that stores the deskew setting The register controls the delay of the SLCB On-chip electrically programmable fuses are incorporated to set the register values To reduce the area required for the fuses, only three of the five deskew setting bits can be addressed with fuses When the device is under test, all five deskew bits can be accessed using SCAN for finer resolution The fuse-based deskew can remove unintentional clock skew caused by on-die process variations and clock network design mismatches It can also inject intentional skew to improve the critical timing paths A fuse-based deskew scheme is selected over an active scheme because of the deterministic nature of the fuse-based algorithm and its simple implementation The intrinsic skew without using any deskew technique is 71 ps The skew reduces to 24 ps when operating with the 3-bit resolution fuse-based deskew It further reduces to 7 ps when the 5-bit resolution SCAN-based deskews is applied The clock distribution of a dual-core Itanium 2 processor, code-named Montecito, is described
in Ref [14] The chip is fabricated on a 90-nm CMOS process with seven layers of copper intercon-nect and it has 1.72 billion transistors with a die size of 21.5 mm× 27.7 mm [26] It implements a dynamically variable-frequency clock system to support a power management scheme, which maxi-mizes processor performance within a configured power envelop [27] Its clock distribution delivers
a variable-frequency clock from 100 MHz to 2.5 GHz over a clock network over 28-mm long The clock network consists of four stages as shown in Figure 43.19 The first stage is the level-0 (L0) route, which connects the PLL to 14 digital frequency dividers (DFDs) The L0 route is the only stage that does not adjust supplies and frequencies during normal operation The L0 route is 20-mm
Trang 10Bus clock
Repeaters
Core0 Core1 PLL
Foxtonx3 IOs Bus logic
DFD
CVD SLCB
SLCB
SLCB SLCB SLCB RAD
RAD
Latches
Latches Latches
Latches Latches Latches
Variable-frequency full rail transitions Fixed frequency
low-voltage swings
Gaters
Gaters Gaters Gaters
CVD
CVD CVD CVD
DFD DFD DFD DFD DFD x3
FIGURE 43.19 Clock distribution of the dual-core Itanium 2 microprocessor (From Mahoney, P., et al., Proc.
IEEE Intel Solid-State Circuits Conf., pp 292–293, 2005 With permission.)
long consisting of four 5-mm segments that are 400-mV low-voltage swing differential routes Each
segment is resistively terminated at the receiver and is tapered to optimize RLC flight time and reduce
power consumption All route segments are matched in composition in both layer and length The second stage, the level-1 (L1) route, connects the DFD to 6–10 SLCBs The DFD output varies
in frequency and it operates on a varying core supply voltage A half-frequency distribution using differential 0◦ and 90◦ clocks is used The third stage, the level-2 (L2) route, connects the SLCB
to LCBs A typical SLCB drives 400 LCBs at 200 different locations across 3 mm with a skew of less than 6 ps between locations For this stage, instead of using a grid-based network as in many
contemporary designs, a skew-matched RLC tree network technique is employed to reduce metal resources and power An in-house tool is utilized to route the trees and to match route RLC delays
using width and space The resulting clock route is adaptable to changes in the design, and uses far less metal resources and power than a grid-based design while achieving skews that are nearly as low
as in grid-based designs The LCBs, called clock vernier devices, can add 70 ps of delay to any clock
in 8 ps increments and are controlled via scan operations They can facilitate postsilicon debug and remove skew not found in presilicon analysis The fourth stage, the postgater route, is in the hands
of the individual circuit designers Clock gaters are designed by the clock team into the library in a variety of sizes With hundreds of latches per gater, routes up to 2-mm long must be engineered for delay, shielding, and load matching
Montecito implements an active deskewing system that runs continuously to null out offsets caused by process, temperature, and voltage variations across the die The system relies on a hierarchical collection of phase comparators between the ends of different L2 routes (i.e., only the first three stages are corrected by deskewing) Each SLCB has a 128-bit delay line with 1-ps resolution With active deskewing and scan-chain adjustments, the total clock-network skew is reduced to less than 10 ps
REFERENCES
1 N Bindal and E Friedman Challenges in clock distribution networks In Proc Intl Symp on Phys Des.,
Monterey, CA, p 2, 1999
2 Q K Zhu High-Speed Clock Network Design Kluwer Academic, Boston, 2003.