Handbook of algorithms for physical design automation part 93 pps

clock PLL Conditioned local clocks Conditioned local clocks State elements Local clocks Local clocks Cond Cond GCLK grid Major clk grid FIGURE 43.5 Alpha 21264 clock hierarchy.. Major cl

Trang 1

clock

PLL

Conditioned local clocks

Conditioned local clocks State elements

Local clocks

Local clocks Cond

Cond

GCLK grid

Major clk grid

FIGURE 43.5 Alpha 21264 clock hierarchy (From Bailey, D.W and Behschneider, B.J., IEEE J Solid-State

Circuits, 33, 1627, 1998 With permission.)

PLL

FIGURE 43.6 Global clock distribution network of Alpha 21264 (From Bailey, D.W and Behschneider, B.J.,

IEEE J Solid-State Circuits, 33, 1627, 1998 With permission.)

also helps power-supply and heat-dissipation problems The GCLK grid is shown in Figure 43.7

It traverses the entire die and uses 3 percent of M3 and M4 All clock interconnect is laterally shielded

with either VDDor VSS All clock wires and all lateral shields are manually placed The measured GCLK skew is 65 ps running at 0◦C ambient and 2.2 V

The six major clocks are two gain stages past GCLK with grids juxtaposed with GCLK, but shielded from it The major clock grids are shown in Figure 43.8 Because of the wide variation of clock loads, the grid density varies widely between major clocks, and sometimes even for a single major clock The densest areas use up to 6 percent of M3 and M4 Major clocks driven by a gridded global clock substantially reduce power because major clock drivers are localized to the clock loads and major clock grids are locally sized to meet the skew targets A gridded global clock without major

Trang 2

FIGURE 43.7 GCLK of Alpha 21264 (From Bailey, D.W and Behschneider, B.J., IEEE J Solid-State

clocks would require larger drivers and a denser grid to deliver the same clock skew and edges Major clocks are designed so that delay from GCLK is centered at 300 ps The target specifications for skew are±50 ps The target specifications for 10–90 percent rise and fall times are less than 320 ps All major clocks easily meet both sets of objectives

PCLK

ECLK

JCLK

FIGURE 43.8 Six major clock grids of Alpha 21264 (From Bailey, D.W and Behschneider, B.J., IEEE

J Solid-State Circuits, 33, 1627, 1998 With permission.)

Trang 3

Local clocks are generally neither gridded nor shielded There are no strict limits on the number, size, or logic function of local-clock buffers, and there is no duty-cycle requirement, although timing path constraints must always be met Local clocks have permitted ranges for clock rise and fall times, but with only this restriction there is considerable design freedom As a result, it facilitates the implementation of clock gating to reduce power and clock skew scheduling to improve performance Because, rather dense grid structures are required to meet the aggressive skew targets, the clock power consumption is very significant At 600 MHz and 2.2 V, typical power usage for the processor

is 72 W The complete distribution network that drives GCLK uses 5.8 W, and GCLK uses 10.2 W The major clocks use 14.0 W Local unconditional clocks use 7.6 W, and local conditional clocks use

a maximum of 15.6 W, assuming they switch every cycle

The clock distribution network design for a 1.2-GHz Alpha 21364 microprocessor can be found

in Ref [22] We choose not to include the details here as Compaq, which acquired DEC in 1998, decided to phase out Alpha on 2001

43.4 INTEL PENTIUM II

The clock distribution network design for a 300-MHz Intel Pentium II microprocessor is presented

in Refs [7,23] The chip is fabricated in a 0.35-µm CMOS process with four metal layers The power supply is 2.8 V The chip has 7.5 million transistors and the die area is 203 mm2 This processor uses

a single spine scheme to distribute the global clock as shown in Figure 43.9 The spine is driven by a balanced tree with five levels of buffers Global clock is distributed to all units in M4 The measured skew is also shown in Figure 43.9 The skew across M4 global distribution is 140 ps The low skew is achieved by balancing the load of each global clock tapping and adjusting global clock track length

SK = ⫺ 564 ps

SK = ⫺ 476 ps

SK = ⫺ 488 ps

SK = ⫺ 592 ps

SK = ⫺ 460 ps

Input point to local buffers with clock gating

SK = ⫺ 424 ps

SK = ⫺ 548 ps

Five-level driver for 500 pF load with M4 metal strapping ring

FIGURE 43.9 Global clock distribution network of Pentium II with electron beam measured skew SK is the

skew relative to feedback point from local buffer (From Young, I.A., Mar, M.F., and Bushan, B., Proc IEEE Intel Solid-State Circuits Conf., pp 330–331, 1997 With permission.)

Trang 4

43.5 INTEL PENTIUM III

The design of an Intel Pentium III microprocessor is presented in Ref [8] This chip has an operating voltage of 1.4–2.2 V and is running up to 650 MHz It is fabricated in a 0.25-µm CMOS process with five metal layers It has 9.5 million transistors and the chip size is 10.17 mm× 12.10 mm This processor uses a two-spine scheme for global clock distribution A two-spine clock block diagram is shown in Figure 43.10 The two spines were shielded properly such that they would not be impacted

by the fringing fields from any interconnects associated with the core as well as I/O sides The two-spine scheme has many benefits over a single-spine approach First, the serpentine wires can be shortened, and hence power consumption can be reduced Second, power distribution to the clock subsystem becomes easier as the clock power demand is more spread out Third, shielding of clock network is also easier as shields are more readily available on sides than in the center Fourth, routing congestion can be improved because there will not be a center spine running through the center part

of the chip, which is typically most congested

Skew minimization between the two spines is a major challenge Because of the lengthy left and right clock spines with multiple tap points, it was very difficult to match the delays with good accuracy In addition to precision capacitance matching techniques on the global clock tree, an adaptive digital deskewing technique based on a delay-locked loop (DLL) was employed [24] The deskewing circuit is composed of delay lines to both spines, a phase detection circuit, and a controller (Figure 43.10) The phase detection circuit determines the phase relationship between the two spines and generates an output accordingly The controller takes the phase detection information and makes

a discrete adjustment to one of the delay lines The digital delay line is implemented with two inverters in series Each inverter has a bank of eight capacitive loads connecting to the output The addition or removal of the capacitive loads is controlled by the delay shift register This allows 17 monotonic discrete steps of delay Latency from sampling clocks to making adjustment to the delay lines is just over three cycles Note that this DLL-based deskewing scheme compensates for not only interconnect/device mismatch but also process, voltage, and temperature variations Adaptive deskewing helped to reduce the left-to-right clock spine skews from 100 to 15 ps

43.6 INTEL PENTIUM 4

The clocking scheme of a 2-GHz Intel Pentium 4 microprocessor is presented in Ref [9] The chip

is fabricated in a 0.18-µm CMOS process with six metal layers The chip has 42 million transistors and the die area is 217 mm2

Core

PD

clk_Gen

Delay line Delay line

Delay SR Deskew cti Delay SR

X clk

FB clk

FIGURE 43.10 Block diagram for two-spine global clock distribution of Pentium III (From Senthinathan,

R., Fischer, S., Rangchi, H., and Yazdanmehr, H., IEEE J Solid-State Circuits, 3, 1454, 1999 With permission.)

Trang 5

FIGURE 43.11 Three spines in a 0.18-µm Pentium 4 (From Kurd, N., Barkatullah, J., and Dizon, R., IEEE

To cover the large Pentium 4 die, its global clock distribution uses three spines as shown in Figure 43.11 A modified buffered binary tree is used to distribute the global clock from the clock generator to the spines Then 47 domain buffers are driven, producing 47 independent clock domains (Figure 43.12) Domain buffers can be disabled to power down large functional units to save power The clock distribution network includes static skew optimization capability to correct systematic skew (caused by asymmetric layout or within-die process variation) as well as provide intentional skew Each domain buffer consists of a programmable delay stage controlled by a 5-bit domain deskew register (DDR) that determines the edge timing of the domain clock The values of the DDRs can be set according to phase information obtained by a phase-detector network of 46 phase detectors

From PLL

FIGURE 43.12 Global clock distribution in Pentium 4 (From Kurd, N., Barkatullah, J., and Dizon, R., IEEE

Trang 6

10.7 mm

Clock stripes

FIGURE 43.13 Eight stripes (i.e., spines) in a 90-nm Pentium 4 (From Bindal, N et al., Proc IEEE Intel.

Solid-State Circuits Conf., pp 346–498, 2003.)

This deskewing scheme can reduce interdomain skew from 64 to about 16 ps A major component

of the clock distribution jitter is due to supply noise from logic switching To reduce supply-noise

induced jitter, an RC-filtered power supply is used for global clock drivers.

The clock distribution design for a next generation Pentium 4 microprocessor that scales to

5 GHz is described in Ref [10] The chip is implemented in a 1.2 V, 90-nm dual-Vt process with seven metal layers The die size is 10.2 mm× 10.7 mm

The clock network consists of a pre-global clock network (PGCN), a global clock grid (GCG), and local clocking The PCGN comprises 12 inversion stages from the PLL to the die center, and 15 stages

to the input of more than 1400 GCG drivers It has a tree structure with strategic shorting of inputs to adjacent receivers within a stage to eliminate skew accumulation over multiple stages because of ran-dom variations Shorting of adjacent receivers provides a very gradual clock skew gradient at the input

to adjacent GCG drivers The GCG consists of eight spines spaced roughly 1200µm apart, as shown

in Figure 43.13 The local scheme consists of two stages of gated buffering The first stage is used for reducing power consumption through clock gating The second stage is reserved for functional gating The design achieves less than 10 ps of global clock skew The final grid stage and its driver dissipate 1.75 W/GHz in addition to 0.75 W/GHz in the PGCN Overall die area allocation ranges from 0.25 percent for devices and lower metals, to less than 2, 3, and 5 percent for M5, M6, and M7 layers, respectively

43.7 INTEL ITANIUM

The clock design of an 800-MHz Itanium microprocessor is presented in Ref [11] The microproces-sor is the first implementation of Intel’s IA-64 architecture Its core contains 25.4 million transistors and is fabricated on a 0.18-µm, six layer metal CMOS process The high level of integration requires

a significant silicon real estate and high clock loading The large die size and the small feature size result in prominent within-die process variation Hence, the Itanium processor uses an active deskewing scheme in conjunction with a combined balanced clock tree and clock grid to distribute the clock over the die The design also provides enough flexibility for the local clock implementation

to support intentional clock skew and time borrowing

Trang 7

Reference clock

Global distribution

Local distribution

Regional distribution

GCLK

CLKP CLKN VCC/2

Main clock

RCD

DSK PLL

DSK

DLCLK OTB

FIGURE 43.14 Clock distribution topology of the Itanium microprocessor (From Tam, S et al., IEEE

DSK

DSK DSK

PLL DSK

FIGURE 43.15 Global core H-tree of the Itanium microprocessor (From Tam, S et al., IEEE J Solid-State

The clock system architecture is shown in Figure 43.14 The clock topology is partitioned into global distribution, regional distribution, and local distribution

In the global distribution, a core clock and a reference clock are routed from a PLL clock generator

to eight deskew clusters via two identical and balanced H-trees A schematic drawing of the global core clock tree is shown in Figure 43.15 The global clock tree is implemented exclusively in the two highest level metal layers To reduce capacitive noise coupling and to ensure good inductive return

path, the tree is fully shielded laterally with VDD/VSS In addition, inductive reflections at the branch points are minimized by properly sizing the metal widths for impedance matching

The regional clock distribution encompasses the deskew buffer, the regional clock driver (RCD), and the regional clock grid There are 30 separate clock regions each consisting of the above three elements The 30 regional clocks are illustrated in Figure 43.16 Each of the eight deskew clusters consists of four distinct deskew buffers Because 32 deskew buffers are available, two of them are unused The deskew buffer is connected to the RCDs by a binary distribution network, which uses top layer metals with complete lateral shielding The RCDs are located at the top and bottom of the regional clock grid The grid is implemented using M4 and M5 As with the global clock network,

it contains full lateral shielding to ensure low capacitance coupling and good inductive return paths The regional clock grid utilizes up to 3.5 percent of the available M5 and up to 4.1 percent of the available M4 routing over a region

The deskew buffer architecture is shown in Figure 43.17 It is a digitally controlled DLL structure

A phase detector residing within the local controller of the deskew buffer analyzes the phase difference

Trang 8

DSK CDC

CDC

DSK

= Cluster of four deskew buffers

= Central deskew controller

FIGURE 43.16 Thirty regional clocks of the Itanium microprocessor (From Tam, S et al., IEEE J Solid-State

RCD

Regional clock grid

Deskew buffer

Delay circuit Global clock

TAP I/F

Ref clock

Local controller

FIGURE 43.17 Deskew buffer architecture of the Itanium microprocessor (From Tam, S et al., IEEE J.

Solid-State Circuits, 35, 1545, 2000 With permission.)

between the reference clock and a local feedback clock sampled from the regional clock grid Then the core clock delay is adjusted through a digitally controlled analogue delay line Experimental skew measurements show that the total skew is 28 ps with deskewing and is 110 ps without deskewing The local clock distribution consists of local clock buffers (LCBs) and local clock routings that are embedded within a functional unit The LCBs receive the input directly from the regional clock grid and then drive the clocked sequential elements

43.8 INTEL ITANIUM 2

The clock distribution of the 1-GHz Itanium 2 processor is described in Refs [12,25] The chip is fabricated on a 180-nm CMOS process with six layers of aluminum interconnects The processor has 25 million logic transistors and 221 million total transistors The die size is 21.6 mm× 19.5 mm

Trang 9

B

Core

primary

driver

Repeaters SLCBs

Gaters L2R

To PLL A FSB clock

C

From latch pipe

FIGURE 43.18 Clock distribution of the Itanium 2 microprocessor (From Anderson, F.E., et al., Proc IEEE

Intel Solid-State Circuits Conf., pp 146–147, 2002 With permission.)

The clock network of this processor is shown in Figure 43.18 Similar to that of the Itanium processor in Section 43.7, it can be partitioned into global distribution (L1R), regional distribution (L2R driven by second-level clock-buffers [SLCBs]), and local distribution (driven by gaters) How-ever, it has three significant differences from that of Itanium First, the global clock network, which

is also implemented as a balanced H-tree, applies differential routing to reduce jitter from supply noise, injected common mode noise, and signal slew rates It is also heavily shielded to reduce jitter because of coupled noise Second, instead of grids, the regional distribution makes use of width and length balanced side-shielded H-trees Third, deskewing technique is not utilized The skew is minimized by precisely tuning the delay of the H-trees It achieves a skew of 62 ps

The clock distribution of a more advanced Itanium 2 processor is presented in Ref [13] This chip is fabricated on a 130-nm CMOS process with six layers of copper interconnects It operates at 1.5 GHz at 1.3 V It has a total of 410 million transistors with a die size of 374 mm2

The main difference from its 180-nm predecessor is that this design implements a fuse-based deskewing technique to address the clock skew issue and to increase the frequency of operation There are 23 regional clocks in the core The SLCB associated with each region contains a 5-bit register that stores the deskew setting The register controls the delay of the SLCB On-chip electrically programmable fuses are incorporated to set the register values To reduce the area required for the fuses, only three of the five deskew setting bits can be addressed with fuses When the device is under test, all five deskew bits can be accessed using SCAN for finer resolution The fuse-based deskew can remove unintentional clock skew caused by on-die process variations and clock network design mismatches It can also inject intentional skew to improve the critical timing paths A fuse-based deskew scheme is selected over an active scheme because of the deterministic nature of the fuse-based algorithm and its simple implementation The intrinsic skew without using any deskew technique is 71 ps The skew reduces to 24 ps when operating with the 3-bit resolution fuse-based deskew It further reduces to 7 ps when the 5-bit resolution SCAN-based deskews is applied The clock distribution of a dual-core Itanium 2 processor, code-named Montecito, is described

in Ref [14] The chip is fabricated on a 90-nm CMOS process with seven layers of copper intercon-nect and it has 1.72 billion transistors with a die size of 21.5 mm× 27.7 mm [26] It implements a dynamically variable-frequency clock system to support a power management scheme, which maxi-mizes processor performance within a configured power envelop [27] Its clock distribution delivers

a variable-frequency clock from 100 MHz to 2.5 GHz over a clock network over 28-mm long The clock network consists of four stages as shown in Figure 43.19 The first stage is the level-0 (L0) route, which connects the PLL to 14 digital frequency dividers (DFDs) The L0 route is the only stage that does not adjust supplies and frequencies during normal operation The L0 route is 20-mm

Trang 10

Bus clock

Repeaters

Core0 Core1 PLL

Foxtonx3 IOs Bus logic

DFD

CVD SLCB

SLCB

SLCB SLCB SLCB RAD

RAD

Latches

Latches Latches

Latches Latches Latches

Variable-frequency full rail transitions Fixed frequency

low-voltage swings

Gaters

Gaters Gaters Gaters

CVD

CVD CVD CVD

DFD DFD DFD DFD DFD x3

FIGURE 43.19 Clock distribution of the dual-core Itanium 2 microprocessor (From Mahoney, P., et al., Proc.

IEEE Intel Solid-State Circuits Conf., pp 292–293, 2005 With permission.)

long consisting of four 5-mm segments that are 400-mV low-voltage swing differential routes Each

segment is resistively terminated at the receiver and is tapered to optimize RLC flight time and reduce

power consumption All route segments are matched in composition in both layer and length The second stage, the level-1 (L1) route, connects the DFD to 6–10 SLCBs The DFD output varies

in frequency and it operates on a varying core supply voltage A half-frequency distribution using differential 0◦ and 90◦ clocks is used The third stage, the level-2 (L2) route, connects the SLCB

to LCBs A typical SLCB drives 400 LCBs at 200 different locations across 3 mm with a skew of less than 6 ps between locations For this stage, instead of using a grid-based network as in many

contemporary designs, a skew-matched RLC tree network technique is employed to reduce metal resources and power An in-house tool is utilized to route the trees and to match route RLC delays

using width and space The resulting clock route is adaptable to changes in the design, and uses far less metal resources and power than a grid-based design while achieving skews that are nearly as low

as in grid-based designs The LCBs, called clock vernier devices, can add 70 ps of delay to any clock

in 8 ps increments and are controlled via scan operations They can facilitate postsilicon debug and remove skew not found in presilicon analysis The fourth stage, the postgater route, is in the hands

of the individual circuit designers Clock gaters are designed by the clock team into the library in a variety of sizes With hundreds of latches per gater, routes up to 2-mm long must be engineered for delay, shielding, and load matching

Montecito implements an active deskewing system that runs continuously to null out offsets caused by process, temperature, and voltage variations across the die The system relies on a hierarchical collection of phase comparators between the ends of different L2 routes (i.e., only the first three stages are corrected by deskewing) Each SLCB has a 128-bit delay line with 1-ps resolution With active deskewing and scan-chain adjustments, the total clock-network skew is reduced to less than 10 ps

REFERENCES

1 N Bindal and E Friedman Challenges in clock distribution networks In Proc Intl Symp on Phys Des.,

Monterey, CA, p 2, 1999

2 Q K Zhu High-Speed Clock Network Design Kluwer Academic, Boston, 2003.

Định dạng
Số trang	10
Dung lượng	364,4 KB