This is shown in Figure 39.1 where the original timing histogram Figure 39.1a is improved by critical path optimizations Figure 39.1b until it saturates.. As in critical path optimizatio
Trang 239 Placement-Driven Synthesis
Design Closure Tool
Charles J Alpert, Nathaniel Hieter, Arjen Mets,
Ruchir Puri, Lakshmi Reddy, Haoxing Ren,
and Louise Trevillyan
CONTENTS
39.1 Introduction 813
39.2 Major Phases of Physical Synthesis 814
39.3 Optimization and Placement Interaction 816
39.3.1 Bin-Based Placement Model 817
39.3.2 Exact Placement 818
39.4 Critical Path Optimizations 818
39.4.1 Gate Sizing 819
39.4.2 Gate Sizing with Multiple-vt Libraries 819
39.4.3 Incremental Synthesis 820
39.4.4 Advanced Synthesis Techniques 823
39.4.5 Fixing Early Paths 823
39.4.6 Drivers for Multiple Objectives 823
39.5 Mechanisms for Recovery 824
39.5.1 Area Recovery 824
39.5.2 Routing Recovery 826
39.5.3 vt Recovery 827
39.6 Other Considerations 827
39.6.1 Hierarchical Design 827
39.6.2 High-Performance Clocking 830
39.6.3 Power Gating to Reduce Leakage Power 830
39.7 Into the Future 832
References 833
39.1 INTRODUCTION
Much of this book has focused on the components of physical synthesis, such as global placement, detailed placement, buffering, routing, Steiner tree, and congestion estimation Physical synthesis combines these steps as well as several others to (primarily) perform timing closure When wire delays were relatively insignificant compared to gate delays, logic synthesis provided a sufficiently accurate picture of the timing of the design Placement and routing did not need to focus on timing, but were exclusively wirelength driven Of course, technology trends have transformed physical design because the physical implementation affects timing
813
Trang 3once implemented physically due to wire delays Physical synthesis is a process that modifies the design so that the impact on timing due to wiring is mitigated It may move cells, resize logic, buffer nets, and perform local resynthesis
Besides basic timing closure, there are many newer challenges that the physical synthesis system needs to handle [1] Some examples include lowering power using a technology library with multiple threshold voltages (vt), fixing noise violations that show up after performing routing, and handling the timing variability and uncertainty introduced by modern design processes
This chapter surveys IBM’s physical synthesis tool, called placement-driven synthesis (PDS) or placement-driven synthesis It builds upon a description of the basics of the tool [2] and also some innovations in turnaround time published in Ref [3]
39.2 MAJOR PHASES OF PHYSICAL SYNTHESIS
Placement-driven synthesis has hundreds of parameter settings available to the user and can be customized by the designer to run in many ways For example, there are different degrees of routing congestion mitigation or area recovery available The user may want to exploit gates with low vt
or allow assignment of wires to different routing planes These choices depend on the nature of the design being closed Although there is no single PDS algorithm to describe, the following outlines
a typical invocation:
1 Netlist preparation When PDS initializes, of course, the data model needs to be loaded with timing assertions (which encapsulate the timing constraints), user parameters, etc There also may need to be some scrubbing of the netlist so that optimization is even viable As examples,
• Gates may need to be sized down so that the total area of the netlist fits within the area
of the placaeable region
• Buffers inserted during synthesis may need to be removed so that they do not badly influence placement A placement algorithm may handle a fanout tree several levels deep, then they logically equivalent single large net
• If the clock tree has not yet been built, it may need to be hidden from optimization so that it is not treated as a signal net Changes to a clocked sequential cell could otherwise cause the timing for every cell in the clock tree to be updated Before synthesis, an ideal clock with zero skew can be assumed and later replaced with the optimized one
• Timing information can be extracted from either an unplaced or a previously optimized netlist to generate net weights for the placement step
2 Global placement This step is well-covered in Chapters 14 through 19 Besides just tradi-tional minimum wirelength optimization, placement needs to address several other types of constraints For example,
• Density targets direct the placer to not pack cells in tightly in certain areas, so that physical synthesis will have the flexibility to size up cells, insert buffers, etc
• Designer cell movement constraints are used to enable floorplanning in a flat methodology By restricting a set of cells to a certain rectangular region, the designer is able to plan that block, while still allowing the tool the flexibility to perform optimizations and placements of the cells within the block
• Routability directives can be used to improve the routability of placement, such as arti-ficially inflating the size of cells in routing congested regions in order to force more spreading [4]
• Clock domain constraints can be considered during placement to reduce clock tree latency and dynamic power consumption Latches that belong to the same clock domain can be
Trang 4directed to be placed close to each other by either adding special net weights or by imposing movement constraints on latches
3 Timing analysis At every point in the flow, timing analysis is a core component because it provides the evaluation of how well PDS is doing in terms of timing closure It is run both stand alone and incrementally throughout the optimization For this, IBM’s static timing analysis tool, EinsTimer [5], is used
4 Electrical correction After placement, one will certainly find gates that drive loads above the allowed specification and long wires for which the signal exceeds the designer specified slew rate A few bad slew rates inevitably cause terrible timing results At this point, it makes sense to correct the design by fixing local slew and capacitance violations, typically through buffering and gate sizing, thereby getting the design into a reasonably good timing state One can also employ a logical effort [6] type of approach to improve the global timing characteristics of the design
5 Placement legalization Fixing electrical violation may result in thousands of buffers being added to the design, and potentially every gate may be assigned a new gate size, which will create overlaps, causing the placement solution to become illegal The goal of legalization
is to fix these overlaps while providing minimum perturbation to the netlist (Chapter 20)
6 Critical path optimization Once the design is legal and is in a reasonably good timing state, one can employ all kinds of techniques to try to fix the critical paths Chapters 26 through
28 discuss powerful buffering techniques Section 39.5 describes how other optimization
or transforms can also be deployed A transform is a change to the netlist designed to improve some aspect of the design, for example, breaking apart a complex gate into several smaller simpler ones During this phase, incremental timing analysis and legalization may
be periodically invoked to keep the design in a legal and consistent state
7 Compression Critical path optimization may become stuck at some point, when a certain set of the most critical paths cannot be fixed without manual design intervention (e.g., changes to the floorplan must be made) This is shown in Figure 39.1 where the original timing histogram (Figure 39.1a) is improved by critical path optimizations (Figure 39.1b) until it saturates However, there still may be thousands of failed timing points that exist which could be fixed with lighter weight optimizations directed at the not so critical regions that still violate timing constraints The purpose of this phase is to compress the remaining negative portion of the timing histogram to leave as little work as possible for the designer as shown in Figure 39.1c As in critical path optimization phase, incremental timing analysis and legalization must be incorporated where appropriate
(b)
Slack
(c)
Slack
FIGURE 39.1 Timing histogram of (a) an unoptimized design can be improved by (b) critical path optimization
and (c) histogram compression
Trang 5this point, a designer could intervene manually or rerun the flow to try and get a better timing-driven placement now that the real timing problems have been identified One can run a net weighting algorithm (Chapter 21) to drive the next iteration of placement and the entire flow
In the flow described above, one can make several assumptions to make fast optimization achiev-able For example, (1) clocks can be idealized so that one assume a zero skew clock will later be inserted, (2) Steiner estimates and one-dimensional extraction can be used for interconnect delay estimation, (3) crosstalk can be ignored, etc Making these assumptions certainly allows faster run-time than otherwise achievable In practice, these assumptions are stripped away as the designer makes progress toward timing closure Once the designer is reasonably happy with the design after running PDS, he or she may then perform clock insertion and perform a pass of incremental physical synthesis to fix problems resulting from actual clock skews Similarly, once the design is routed, there could be timing problems caused by scenic routes, along with noise violations from capacitive coupling The designer can then run incremental physical synthesis in this postrouting environment, using accurate coupling information while also modeling variability for timing
Because several of the main components of the physical synthesis flow are covered elsewhere
in the book, this chapter focuses on aspects that are not covered
1 Optimization and placement interaction When optimizations such as buffering or resizing need to make adjustments to the netlist, they cannot happen in a vacuum because they affect the placement Certain regions may have blockages or be too congested to allow transforms to happen We explain the communication mechanisms between optimization and placement
2 Critical path optimizations Besides buffering, there are numerous techniques one can use to improve the timing along a critical path Section 39.4 overviews gate sizing and incremental synthesis techniques and the driver/transform model that PDS uses
3 Recovery mechanisms During optimization, PDS can cause damage by overfilling local regions, causing routing congestion, etc Section 39.5 explains how one can apply spe-cialized optimizations for repairing damage so that physical synthesis can continue effectively
4 Specialized design styles A typical instance for PDS is a flat ASIC, though customers also utilize it for hierarchical design and for high-performance microprocessors Section 39.6 explains some of the issues faced by PDS and their solutions for these different types of design styles
39.3 OPTIMIZATION AND PLACEMENT INTERACTION
During the critical-path optimization and compression, optimizations such as buffer insertion, gate sizing, box movement, and logic restructuring may need to add, delete, move, or resize boxes
To estimate benefit/cost of these transformations accurately, transforms need to generate legal or semilegal locations for these boxes on the fly Otherwise, boxes may be moved to overcongested locations or even on top of blockages, which later need to be resolved by legalization Legalization then may move boxes far from their intended locations and undo (at least in part) the benefits of optimization It could even introduce new problems that need further optimization
Of course, ideally one would like to compute the exact legal locations for such boxes during optimization, but it often can be too computationally expensive One strategy PDS uses is to use rough legal locations during early optimization (e.g., electrical correction) when substantial changes are made During later stages of optimization when smaller or finer changes are made, exact legal locations may be computed Such a strategy strikes a good balance between quality of results and the runtime of the system
Trang 639.3.1 BIN-BASEDPLACEMENTMODEL
PDS uses a synthesis–placement interface (SPI) to manage the estimation or computation of incremental placement Before optimization, the placement image is divided into a set of regions called bins Each placeable object in the design is assigned to a bin, and space availability is deter-mined by examining the free space within a bin The SPI layer manages the interface to an idealized view of the bin structure and provides a rich set of functions to access and manipulate placement data The SPI layer uses callbacks to keep placement, optimization, and routing data consistent Instead of computing an exact legal location, newly created or modified logic can be placed in a bin and assigned a coarse-placement location inside the bin A fast check is performed to make sure that there is enough free space within the bin to accommodate the logic
The interaction between optimizations and the SPI layer works as follows Suppose an optimiza-tion requests SPI to add or move a box to a specific(x, y) location SPI gets the bin in which the (x, y)
location falls and checks the free space If there is enough space then optimization uses the location specified If not, the optimization may ask SPI to find the closest bin in which there is space, in which case, SPI “spirals” through neighboring bins and returns a valid location, which the optimization can evaluate and choose to use When a placement is actually assigned, SPI updates bin information to accurately reflect the state of the placement
Using rough placement may result in boxes placed so they overlap each other This is one reason why legalization needs to be called periodically (see e.g., Ref [7]) It is important for the optimized design to remain stable, so the legalizer maintains as many pre existing locations as possible and, when a box must move, an attempt is made to disturb the timing of critical paths as little as possible
As an example, assume the potential area of placed logic inside a bin is 1000 units and that 930 units of cells are already placed within the bin If one tries to add a new cell of size 90, the SPI interface reports that the bin would become too full (1020) and cannot afford to allow the cell to be placed On the other hand, a cell of size 50 can fit (total area 980) so SPI would permit the transform
to place the cell in the bin
The problem with the bin-based model is that just because the total area allows another cell to
be inserted, does not mean it actually can be inserted As a simple example, consider placing three cells of width three into two rows of width five, with height one for cell and each row The total area of the cells is nine, while the total placeable area is ten, so it would seem like the cells could fit However, the cells cannot be placed without exceeding the row capacity In this sense, legalizing cells within a bin so that they all fit is like the NP-complete bin packing problem
Consider Figure 39.2 in which Bin A and Bin B have exactly the same set of nine cells, though arranged differently If one tries to insert a new cell into either bin, SPI would return that there is room
in the bin, yet one cannot easily insert it in Bin A while one can in Bin B It is likely that the fracturing
of white space in Bin A will lead to legalization eventually moving a cell into a different bin
New cell
FIGURE 39.2 New cell cannot be inserted into (a) Bin A but can be in (b) Bin B even though both bins contain
the same set of cells
Trang 7more likely that cells will avoid unsolvable bin packing scenarios In our earlier example, we put the virtual bin capacity at 950 Alternatively, one can allow overfilling of bins (say by 5 percent) to allow transforms to successfully perform optimization, and then rely on powerful legalization techniques like diffusion (Chapter 20) to reduce the likelihood of legalization moving cells far away
As physical synthesis progresses, the bins are reduced in size This tends to limit the size of box movements that legalization must do
39.3.2 EXACTPLACEMENT
The major problem with the bin-based model is that one can never guarantee that the cell really does fit in its bin One could always construct test cases with cells of strange sizes that break any bin model (or force it to be ultra-conservative in preventing cells to be inserted) For example, fixed-area I/Os and decoupling capacitors can contribute to the problem When it gets too late in the flow, PDS may not be able to recover from big legalization movements that degrade timing During later stages
of the system when the major optimizations have been completed, finding exact locations for the modified cells provides better overall quality of results with reasonable runtimes
PDS implements exact legal locations during optimization as follows The placement subsystem maintains an incremental bit map (imap) to track all location changes and available free space For example, if a cell is one row high and seven tracks wide, then seven bits of the imap corresponding to the cell’s location are set to one If the two tracks next to the cell are empty, their bits are set to zero When a new or modified box needs to be placed at a desired location, the imap capability essentially works like a hole finder It tries to locate a hole or an empty slot (within some specified maximum distance from the desired location) large enough to place legally the newly created or modified box
As with rough locations, the optimization can evaluate and choose to use the exact locations If this location is used then the imap data model is updated incrementally Thus, when timing evaluates the quality of the solution, it knows exactly where the cell will end up In this model, legalization is not necessary
An example of one problem with the imap model occurs when a cell seven tracks wide wants to
be placed in a hole that is five tracks wide To a user, it may be obvious to simply slide the neighboring cell over by two tracks to make room In general, small local moves like this will have minimal effect
on timing and make it more likely that the cells will be placed at their desired locations In such a case,
a list of all the cells that need to be moved to make room for the new/modified box as well as their new locations is supplied to the transform The transform can then evaluate this compound movement
of a set of boxes and estimate the benefit/cost and decide to accept or reject such movement The advantages of this approach include more successes in legally placing boxes within some specified maximum as well as obtaining legal locations that are generally closer to the desired locations On the other hand, the transforms may get more complicated as they need to manage and evaluate the movement of, possibly unrelated, multiple cells It may also cause more churn to the design during the later stages of optimization due to the movement of significantly larger number of boxes, which may not be directly targeted by the optimizations
Thus, it is a bit of an art to find the right degree of placement and optimization interaction that trades off accuracy versus runtime These models are still evolving in PDS today
39.4 CRITICAL PATH OPTIMIZATIONS
Optimization of critical paths is at the heart of any physical synthesis system Timing closure is clearly an important goal, but electrical correctness, placement and routing congestion, area, power, wirelength, yield, and signal integrity are also important design characteristics that must be considered and optimized when making incremental changes to the netlist
Within PDS, there is a large menu of optimizations that can be applied to the design The sequences of optimizations are packaged for various functions and can be enabled or disabled
Trang 8via system parameters Optimizations may also be used interactively by designers The most effective optimizations are generally buffering and gate sizing As a secondary dimension for optimization, with buffering one can also perform wire sizing and with gate sizing one can perform assign gates
to different vt Because buffering is covered in other chapters, we turn to gate sizing
39.4.1 GATESIZING
Gate sizing is responsible for selecting the appropriate drive strength for a logic cell from the functionally equivalent cells available in the technology library For example, a library may contain
a set of ten inverters, each with a characteristic size, power consumption, and drive strength Upon finding an inverter in the design, it is the task of gate sizing to assign the inverter with the appropriate drive strength to meet design objectives
When the mapped design comes from logic synthesis, gate sizes have already been assigned based on the best information available at the time Once the design is placed, Steiner wire estimates can be used to give a more-accurate estimation of wire loads, and many of the previous assignments will be found to be suboptimal Likewise, gate sizes must be reevaluated after global and detailed routing, because wire delays will again have changed
As discussed earlier, the electrical correction step performs an initial pass over the entire design Gate sizes are assigned in a table-lookup fashion to fix capacitance and slew violations introduced
by the more accurate Steiner wire models There may be several cells in the library that meet the requirements of a logic cell, so the one with minimal area is chosen If gate sizing is insufficient to fix the violation, buffering or box movement may be used
Later optimizations have the option of modifying these initial gate sizes If a cell in the design
is timing critical, the library cell that results in the best path delay would be chosen, while if the cell already meets its timing requirements, area recovery will pick the cell with greatest area savings For critical path optimizations, gate sizing examines a size-sorted window of functional alterna-tives and evaluates each of them to choose the best library cell For example, suppose that the current cell is a NAND2_D, and the library has, from smallest to largest, NAND2_A through NAND2_G cells The program might evaluate the B, C, E, and F levels to see if they are a better fit for the opti-mization objectives The size of the window is dynamic and affects both the accuracy of the choice and the runtime of the optimization Because the design is constantly changing during optimization,
it is necessary to periodically revisit the assigned gate sizes and readjust them This allows revisiting choices, perhaps with different cell windows
Other algorithms, such as simulated annealing, Lagrangian relaxation (see Chapter 29), or integer programming approaches [8], have been suggested for use in resizing, but they tend to be too slow, given the size of today’s designs and the frequency with which this needs to be done Further, these approaches tend to make gross assumptions about a continuous library, which then needs to be mapped to cells in a discrete library; this mapping may severely distort the quality of the optimization Also, these methods do not account well for capacitance and slew changes resulting from new power-level assignments, and the physical placement constraints, as described above Gate sizing is important to nearly every facet of optimization It is used in timing correction, area recovery, electrical correction, yield improvement, and signal-integrity optimization
39.4.2 GATESIZING WITHMULTIPLE-VTLIBRARIES
Besides performing timing closure, PDS also manages the total power budget See Chapter 3 for an overview of the components of power consumption The contribution of the static power component
or leakage to the total power number is growing rapidly as geometries shrink
To account for that, technology foundries have introduced cell libraries with multiple vt These libraries contain separate cells with the same functionality but with different threshold These libraries contain separate cells with the same functionality but with different vt In the simplest form there are two different thresholds available, commonly called high-vt and low-vt, where vt stands for vt
Trang 9transistors are faster at the expense of higher leakage power and less noise immunity In practice, there is a limit to the number of different vt in the library because each vt introduces an additional mask in the fabrication process
Multi-vt libraries enable synthesis to select not only the appropriate gate size but also the appro-priate vt for each cell Cells on a timing critical path can be assigned a lower vt to speed up the design Cells that are not timing critical do not need the performance of a high-leakage cell and can use the slower and less leaky versions In general, one prefers not to use low-vt cells at all unless they are absolutely necessary to meet high-performance timing constraints
During vt assignment, PDS simply collects all critical gates and sorts them based on their criti-cality vt assignment then proceeds by lowering the vt on the cells starting with the most critical cell first The algorithm honors designer supplied leakage limits by incrementally computing the leakage current in the design
In general, multiple threshold libraries are designed such that the low-vt equivalent of each cell has the same area and cell image as the high-vt cell This makes the multiple voltage threshold optimization a transformation, which does not disturb the placement of a design
Because the input capacitance of a low-vt cell is slightly higher than that of a corresponding
high-vt cell, resizing the cells after threshold optimization can yield further improvement in performance The impact of multiple vt optimization on the power/performance trade off depends on the distribution of slack across the logic Designs with narrow critical regions can yield significant performance improvements with little affect on leakage power The performance boost obtained from using low-vt cells is significant, making it one of the more powerful tools PDS has to fix critical paths
39.4.3 INCREMENTALSYNTHESIS
Besides buffering and gate sizing, many other techniques can be applied to improve critical paths Techniques from logic synthesis, modified to take placement and routing into account, can at times
be very effective Even though the design comes from logic synthesis optimized for timing, the changes caused by placement, gate sizing, buffering, etc may disrupt the original timing and may create an opportunity for these optimizations to be effective in correcting a path that may not be fixable otherwise
• Cell movement: In general, the next most effective optimization technique is cell movement One can move cells to not only improve timing, but also minimize wirelength, reduce placement congestion, or balance pipeline stages For critical path optimization, a simple, yet effective approach is to find a box on a critical path and try to move it to a better location that improves timing
• Cloning: Instead of sizing up a cell to drive a net with a fairly high load, one could copy the cell and partition the sinks of the original output nets among the copies Figure 39.3 shows
an example where four sinks are driven by two identical gates after cloning Cloning can also improve wirelengths and wiring congestion
• Pin swapping: Pin swapping takes advantage of cells where the input to output timings differ
by pin As an example, a 4-input NAND, with inputs A, B, C, and D and output Z the delay from A to Z could be less than the delay from D to Z Some cells may be architected so that the behavior is intentional By swapping a timing critical at pin D with a noncritical signal
at pin A, one can obtain timing improvement More generally, when one has a fan-in tree as
in Figure 39.4, commutative pins can also be swapped, so that the slowest net can be moved forward in the tree Like cloning, pin swapping can also be used to improve wirelength and decrease wiring congestion
Trang 10C1
C1 ′
S
FIGURE 39.3 Cloning (From Trevillyan, et al., IEEE Design and Test of Computers, pp 14–22, 2004 With
permission.)
AND
AND
AND AND
AND
AND AND
AND AND
AND AND
AND
A E H C
F G
B
D
F
D
B
G
H
C
E
A
FIGURE 39.4 Pin swapping.
• Inverter processing: In a rich standard-cell library, complement and dual-complement cells are available for many functions Timing or area can be improved by manipulating invert-ers For example, Figure 39.5 shows an example of an INVERT-NAND sequence being replaced with a NOR-INVERT sequence Other examples include changing an AND-INVERT sequence to a NAND or an AND-AND-INVERT into and OR Inverter processing may remove an inverter, add an inverter, or require an inverter to be moved to another sink
• Cell expansion: The cell library may contain “complex” multilevel functions, such as
AND-OR, XAND-OR, MUX, or other less-well-defined cells These cells normally save space, but can
be slower than a breakdown into equivalent single-level cells (NAND, NOR, INVERT, etc.) Cell expansion breaks apart these cells into its components; for example, Figure 39.6 shows
an XOR gate decomposed into three AND gates and two inverters
• Off-path resizing: As discussed earlier, gate sizing is a core technique for optimization of gates on a critical path However, one can also attempt to reduce the load driven by these gates by reducing the size of noncritical sink cells, as shown in Figure 39.7 The smaller