If the optimization goal is electrical correction the driver will pick cells that violate slew or capacitance limits; if the goal is timing, it will pick cells that lie on the critical p
Trang 1A A B
NOR
FIGURE 39.5 Inverter processing.
A
A
B
B
XOR
NAND
NAND
NAND C C
FIGURE 39.6 Cell expansion.
FIGURE 39.7 Off-path resizing.
A
AND
AND
AND X
Y
AND
F
D
A B
E
B C D E
FIGURE 39.8 Shattering.
cells generally present lower pin capacitances, and so may improve the delay on a timing-critical net, though it could hurt the delay for another path The timing analyzer and the optimization metric is the arbiter on whether the optimization suggestion is accepted by PDS When correcting hold violations (short paths), the off-path cells can be powered up
to present higher pin capacitance and slow down a path
• Shattering: Similar to cell expansion, larger fan-in cell can be decomposed into a tree of smaller cells This may allow the most critical path to move ahead to a faster, smaller
cell Figure 39.8 shows how the delay through pins A and B of a five-input AND gate can
be reduced by shattering the gate into three NAND gates, so that a and b only need to
propagate through a cell with less complexity Merging, the opposite of shattering, can also
be an effective timing optimization A rule of thumb is that merging is good when the slacks
at the inputs of a tree are similar, and shattering is good when there is a wider distribution
of slacks
Trang 2Note that the optimizations are atomic actions and are synergistic Optimizations can call other optimizations For example, a box could be shattered, then pin-swapped, and finally resized, and the new solution could then be accepted or rejected based on the combined effect of these optimizations
39.4.4 ADVANCEDSYNTHESISTECHNIQUES
The descriptions of some of the above incremental synthesis optimizations are deceptively simple For example, Figure 39.6 shows an XOR decomposed as two inverters and three NAND gates It could also be implemented as two inverters, two ANDs, and an OR; two inverters, one OR, and two NANDs; or three inverters, an AND, and two NANDs; etc An optimization like cell expansion examines several decompositions based on rules of thumb, but does not explore the expansion possibilities in any systematic way
Another way of accomplishing cell expansion and many of the other optimizations is through logic restructuring [9], which provides a systematic way of looking at functional implementations In this method, seed cells are chosen and a fan-in and depth-limited cone is examined for reimplemen-tation to achieve timing or area goals The seed box and its restricted cone of logic are represented
as a Boolean decision diagram (BDD) [10] This provides a canonical form from which different logic structures can be implicitly enumerated and evaluated When a new structure is chosen, it is implemented based on the primitives available in the library and the new cells are placed and sized The restructuring process can be thought of as a new technology mapping of the selected cone Advanced synthesis techniques can be a computationally intensive process because, for a large cone, the number of potential implementations can be huge The fan-in and depth constraints must
be chosen so as to balance design quality with runtime However, they are quite effective and are especially useful for high-performance microprocessor blocks, which typically are small yet have very aggressive timing constraints
39.4.5 FIXINGEARLYPATHS
Timing closure consists of correcting both long (late mode) and short (early mode) paths The delay
of long paths must be decreased because the signal is arriving at the register a cycle too late, while the delay of short paths must be increased because they are arriving a cycle too early The strategy
we use in PDS is to correct the long paths without consideration of the short paths, then do short-path correction as a postprocess in such a way as to lengthen the paths without causing a long path to be more critical This can be tricky because it is possible that all the boxes along a short path can be intertwined with a long path
Doing short-path correction requires that there be (at least) two timing models active: early mode timing tests are done with a slow clock and fast data, while late-mode tests are done with fast clocks and slow data The presence of two timing models enables correction of the early mode paths while minimizing or reducing any adverse effects to the late-mode paths In PDS, short-path correction is done very late in the process, after routing and with SPICE extraction
The premier way of correcting short paths is by adding delay pads (similar to buffers) along the path to slow it down In some cases, short-path nets can be reconnected to existing buffers (added for electrical violations or long-path correction) to slow down the path This can correct the path without incurring the area overhead of a new pad As noted above, resizing to a slower cell or powering up side-path cells can also be used for early mode correction
39.4.6 DRIVERS FORMULTIPLEOBJECTIVES
The previous discussion discussed transforms primarily in the context of improving timing However, other objectives like wirelength, routing congestion, or placement congestion can be addressed by the same set of optimizations or transforms
Trang 3To facilitate the use of transforms for multiple objectives, PDS employs a driver-transform paradigm Programs are split into local transforms and drivers The transforms are responsible for the actual manipulation of the logic, for example, adding a buffer, moving or resizing a cell, etc The driver is responsible for determining the sections of logic that need to be optimized If the optimization goal is electrical correction the driver will pick cells that violate slew or capacitance limits; if the goal is timing, it will pick cells that lie on the critical paths; if the goal is area reduction, the driver will choose cells in the noncritical region, where slack can be sacrificed for area The transforms understand their goals (e.g., whether they should be trying so save area of time) and adjust their actions accordingly
The drivers are also responsible for determining which transforms should be applied in what order Given a list of available optimizations, the driver may ask for evaluations of the characteristics
of applying each transform, then choose the order of application based on a cost/benefit analysis (in terms of timing, area, power, etc.) A driver may also leave it to the transform to decide when
to apply, in which case the order of the transform list given to the driver becomes quite important There are a variety of drivers available in PDS
• The most commonly used one is the critical driver, which picks a group of pins with negative slack and sends the pins to the transforms for evaluation Because transforms can interact, the critical driver iterates both on the current and, when no more can be done on its current list, iterates on sets of lists To conserve runtime, it “rememebers” the transforms that have been tried and does not retry failed attempts
• The correction driver is used to filter nets, which violate their capacitance or slew constraints, which can then be used with a transform designed to fix these violations
• Levelized drivers that present design in “input to output” or “output to input” order, and are useful in areas like global resizing, where it is desirable to resize all of the sink cells before considering the driving cell
• There is a randomized driver that provide pins in a random order so that an optimization that relies on the order of pins may discover alternate solution
• The histo driver is used in the compression phase to divide all the failing paths into slack ranges and then work iteratively on each range
• Of special important is the list driver, which simply provides a predetermined list of cells or nets for the transform to optimize This enables the designer to selecting specific pieces of the design for optimization while in an interactive viewing session The designer’s selection
is made into a list of objects for optimization to be processed by the list driver
In summer, PDS contains a large number of atomic transformations and a variety of drivers that can invoke them This yields a flexible and robust set of optimizations that can be included in a fixed-sequence script or can be used directly by designers
39.5 MECHANISMS FOR RECOVERY
During the PDS flow, optimization may occur that damage the design Local regions could become overfull, legalization could slide critical cells far away, unnecessary wiring could be introduced, etc This is inevitable in such a complex system Thus, a key component to PDS is its ability to gracefully recover from such conditions We now overview a few recovery techniques
39.5.1 AREARECOVERY
The total area used by the design is one of the key metrics in physical synthesis The process of reducing area is known as area recovery The goal of area recovery is to rework the structure of the design so as to use less area without sacrificing timing quality; this contrasts with other work in area
Trang 4reduction, which makes more far-reaching design changes [11] or changes logic cell or IP designs
to be more area efficient [12]
Aside from the obvious benefits of allowing the design to fit on the die, or of actually reducing die size, reduction in area also contributes to better routablility, lower power, and better yield It especially useful as a recovery mechanism because it can create placement space in congested areas that other optimizations can now exploit Recall the previously discussed SPI bin model: for a bin
of size 1000, if area recovery can reduce the used call area from 930 to 800, this increases the free space available for other cell (such as buffers) from 70 to 200
When a design comes into physical synthesis from logic synthesis, the design has normally been optimized with a simplified timing model (e.g, constant delay or wireload) Once the design is placed, routes can be modeled using Steiner estimates or actual global or detailed routes As more accurate information is know about the real delays due to wires, logic can be restructured to reduce area without impacting timing closure or other design goals
For example, for most designs, a plurality of the nets will be two-pin nets A wireload model will give the same delay for every two-pin net Obviously, this is a very gross estimate, as some such nets may be only a few tracks long, while others could span the entire chikp Paths with shorter-than-average nets may have seemed critical during logic synthesis but, once the design is placed and routed, are noncritical, while paths with longer-than-average nets may be more critical than predicted during logic synthesis
PDS timing optimizations can also create a need for area recovery when there are multiple
intersecting critical paths For example, in Figure 39.9, PDS will first optimize B because its slack
is more critical than A Using gate sizing, PDS may change B to larger Band thereby improve its slack from−15 to +20 Net it will optimize A, improving its slack from −10 to +10 and also taking
more area This may improve the slack at Bto+30 Area reduction might then be applied to change
Bto B, reducing both area and slack
A good strategy is to have area recovery work on the nontiming-critical paths of the design and give up delay to reduce area Normally, the noncritical regions constitute a huge percentage—80 percent or more—of the gates in the design, so care must be taken to use very efficient algorithms in this domain In addition to being careful about timing, area reduction optimizations must take care
to not disturb other design goals, such as electrical correctness and signal integrity
By far, the most effective method of reducing area is sizing down cells Aside from being extremely effective, this method has the advantages of being nondestructive to placement (because
Slack −15
B⬘
B⬘
B ⬙
A
A⬘
A⬘
Slack +20
Slack +30
Slack +10
Slack −10 Slack −10
Slack +10
Slack +10
FIGURE 39.9 Physical synthesis creates opportunities for area recovery.
Trang 5the new call will fit where the old one was) and minimally disruptive to wiring (the pins might move slightly) Care must be taken when reducing the size of drivers of nets so that they do not become crosstalk victims (Chapter 34) When optimizing without coupling information, noise can
be captured to the first order by a slew constraint As a rule of thumb, Ref [13] recommends that drivers of nets in the slack range of zero to 15 percent of the clock period not have the size of their drivers reduced
Because timing optimizations nearly always add area, a good rule of thumb for area reduction techniques is that they are the reverse of timing optimizations So area reduction can remove buffers, inverter pairs, or hold-violation pads so long as timing and electrical correctness goals are preserved [14] It can force the use of multilevel calls (such as XORs and AOs), which are normally smaller and slower than equivalent sigle-level implementations If appropriate library functions are available,
it can manipulate inverters to, for example, change NANDs to ANDs, ORs, or NORs, if the new configuration is smaller and maintains other design goals
These types of optimizations may be applied locally in a pattern-matching kind of paradigm For example, each noncritical a buffer in a design could be examined to determine whether it can
be removed Another, more general, approach would be to simultaneously apply areas-reduction techniques through re-covering a section of logic to produce a different selection of technology cells In this context, re-covering involves a new technology mapping for the selected logic with
an emphasis on area, and is more frequently used in logic synthesis, as the placement and routing aspects of physical synthesis make this technique extremely complex Some success at re-covering small, fairly shallow (two to four levels) sections of logic has been reported [15]
A useful adjunct to area reduction is yield optimization Overall critical-area-analysis (CAA) yield scores [16] can be reduced by considering individual CAA scores for the library cells and using this as part of the area reduction scheme For example, suppose a transform wants to reduce the size
of a particular cell Two functionally identical cells may be of the same size and either could be used
in the context of the cell to be downsized However, one may have a better CAA score than the other (though slightly different auxiliary characteristics like delay and input capacitance), so the better scoring cell should be used Of course, area reduction generally improves CAA scores by reducing the active area of the design
In some cases, it is desirable to apply area reduction even in the critical-slack range When a design, or part of a design, is placement congested, it is sometimes a good strategy to sacrifice some negative-slack paths by making them slower but smaller to create room to improve paths with even worse slack Again, resizing is a good example Suppose a path has a slack of−50 and it would
be desirable to upsize a cell on the path, but there is no room to do so Downsizing a cell in the neighborhood, degrading its slack from−2 to −4, may make sense as long as the loss from the downsizing is less than the gain from the upsizing Typically, this kind of trade-off is made early in the physical synthesis process
The effectiveness of area recovery is very dependent on the characteristics of the design, on the logic synthesis tool used to create it, and on the options used for the tool Reductions in area of around 5 percent are typical, but reductions in excess of 20 percent have been observed
39.5.2 ROUTINGRECOVERY
Total wirelength and routing congestion can also be recovered Damage to wirelength can be caused
by legalization, buffering, or timing-driven placement For example, when one first buffers a net, it may use a timing-driven Steiner topology (Chapter 25) Later, when one discovers that this net is not critical and meets its timing constraint, it can be rebuffered with a minimum Steiner tree (Chapter 24) to reduce the overall wirelength
PDS has a function that rebuilds all trees with positive slack and sufficiently high windage,
defined as follows A net with k − 1 buffers divides it into k trees Let Tkbe the sum of the minimum
Steiner wirelengh of these k trees Let T0be the wirelength of the minimum Steiner tree with all the
Trang 6buffer removed Windage is the value of T k − T0 Nets with high windage indicate potentially good candidates for wirelength reduction through alternative buffer placement
One can also deploy techniques to mitigate routing congestion A buffer tree that goes through a routing congested region likely cannot be rerouted easily unless one also replaces the buffers Smaller spacing between buffers reduces the flexibility of routing, so these problems must be handled before routing is required PDS has a function that identifies buffer trees in routing congested regions and rebuilds them so that they avoid the routing resources using algorithms described in Chapter 28 Routing techniques can also be mitigate via spreading the placement using diffusion [17]
Of course, wiring congestion can occur independently of buffers As noted earlier PDS has programs that will reduce wirelength by moving boxes (also using the windage model) and by pin swapping within fan-in-trees
39.5.3 VTRECOVERY
As explained previously, for multi-vt libraries, trade-offs among vt levels can have significant impacts
on leakage power and delay In some instances, low-vt cells may have been used to speed up the design but subsequent optimization may have made the use of low-vt unnecessary In terms of Figure 39.9,
it could be that the change from B to B was actually a vt assignment, in which B was a high-vt cell while Bwas low-vt Once A has been changed to further improve timing, it may be possible to change Bback to a higher-vt cell to reduce power
In fact, a reasonable strategy for timing closure is to use low-vt cells very aggressively to close on timing, even though it likely will completely explode the power budget Then, vt recovery techniques can attempt to reduce power as much as possible while maintaining timing closure
39.6 OTHER CONSIDERATIONS
This chapter focuses primarily on physical synthesis in the context of a typical flat ASIC design style However, PDS is also used to drive timing closure for hierarchical designs and for designing the sub-blocks of high-performance microprocessors We now discuss a some issues and special handling required to drive physical synthesis in these regimes
39.6.1 HIERARCHICALDESIGN
Engineers have been employing hierarchical design since the advent of the hardware description languages What has changed over the years is the degree with which the hierarchy is maintained throughout the design automation flow The global nature of optimizations like placement, buffering, and timing, means it is certainly simpler for PDS to handle a flat design However, PDS is just one consideration for designers in terms of whether they design flat or hierarchically Despite the simplicity of flat design, as of this writing, hierarchical design is becoming more prevalent There are several reasons for this:
• Design size: The available memory in hardware may be insufficient to model properly the entire design Although hardware performance may also be an issue, it can often be mitigated through various thread-parallel techniques
• Schedule flexibility: The design begins naturally partitioned along functional boundaries
A large project, employing several engineers, will not be finished all at once Hierarchical design allows for disparate schedules among the various partitions and design teams This
is especially true for microprocessor designs
• Managing risk: Engineers cannot afford to generate a great deal of VHDL and then simply walk away In some cases, the logic design process is highly interactive The design automa-tion tools must successfully cope with an ever-changing netlist in which logic changes may
Trang 7arrive very late in the schedule By partitioning the design, it is possible to limit the impact
of these changes, protecting the large investment required to reach the current design state
• Design reuse: It is common to see the same logic function replicated many times across the design In a fully automated methodology, one can uniquely construct and optimize each instance of this logic If the uses are expanded uniquely, each use can be optimized
in the context in which it is used If the physical implementation is reused, then the block must be optimized so that it works in all of its contexts simultaneously, which is a more challenging task However, common practice shows that even the so-called fully automated methodologies require a fair amount of human intervention Although reuse does present more complexity, there is a point (number of instances) for every design, for which the benefit of implementing the logic just once outweighs the added complexity
After choosing and hierarchical design automation methodology, the single most important decision impacting physical synthesis is the manner in which the design is partitioned One may work within the boundaries implied by the logic design, or instead one may completely or partially flatten the design and allow the tools to draw their own boundaries The first choice, working within the confines of the logic design, is still the most common use of hierarchy
Hierarchical processing based on logical partitioning involves getting a leaf set of logical par-titions (perhaps using node reduction as described below) then using those parpar-titions as physical entities, which are floorplanned In this sense, the quality of the logical partitioning is defined by the quality of the corresponding physical and timing partitioning, which in turn directly affects difficulty
of the problem presented to PDS
But this is a source of conflict in developing the design From a functional point of view, for example, the designer might develop a random logic block that describes the control flow for some large section of dataflow logic This is a good functional decomposition and is probably good for simulation, but it may not be good physically because in reality one would not want the control to
be segmented in a predefined area by itself, but would want it to be interspersed among the data flow block
The distribution of function within the logical hierarchy may make it impossible to success-fully execute physical synthesis Attributes of an optimal physical partitioning include a partition placement and boundary pin assignment that construct realively short paths between the partitions Attributes of an optimal timing partitioning include paths that do not go in and out of several par-titions before being captured by a sequential element with the signals being launched of captured logically close to the hierarchical boundaries
An effective partitioning also include a distribution of chip resources The first step is to reduce the number of hierarchical nodes by collapsing the design hierarchy Collapsing the design hierarchy removes hierarchical boundaries that will constrain PDS In practice, this node reduction is limited only by the performance of the available tools It is possible (even probable) that some logic function get promoted all the way to the top level if it interfaces with multiple partitions In our earlier example, the control flow logic partition would be a good candidate to promote to the top level so its logic could be distributed as needed As noted above, one of the motivating factors for doing hierarchical design is to manage risk by limiting the impact on the design of logic changes to the logical partition Collapsing nodes can reduce this advantage of hierarchy, so there is again a conflict between obtaining
a good physical representation and maintaining the logic hierarchy for engineering changes The next step, floorplanning, is to assign space on the chip image to each partition while reserving some space for top level logic These two steps, although guided by automated analysis, usually require a fair amount of human intervention
To run PDS on a partition out of the context of the rest of the design hierarchy, sufficient detail regarding the hierarchical boundaries must be provided The floorphanning steps specify the outline
of the hierarchical boundary What remains is the determinations determine of the location of the
Trang 8pins on the hierarchical boundary and their timing characteristics These details are best determined
by viewing the design hierarchy as “virtually flat” and performing placement and timing analysis
A virtually-flat placement simultaneously places all partitions, allowing hierarchical boundary pins to float, while constraining the contents of each partition to adhere to the floorphan The hierar-chical boundary pins are then placed at the intersection of the hierarhierar-chical net route with the outline
of the partition The timing details for hierarchical boundary pins can be calculated by constructing a flat timing graph for the hierarchical design Once the hierarchical boundary paths have been timed, the arrival and required arrival times should be adjusted by apportioning the slack
This process of slack apportionment involves examining a timing path that crosses hierarchical boundaries and determining what portion of that path may be improved through physical synthesis
To perfectly solve this problem, the slack apportionment algorithm would have to encompass the entire knowledge base of the optimization suite Because, this is impractical, one must rely upon simple heuristics The elastic delay of a particular element in a hierarchical path can be modeled as a simple weight applied against the actual delay If it is known that a portion of the design will not be changing much, one would assert a very low elasticity In the case of an static random access memory (SRAM) or core, a zero elasticity would be used Once the elastic delay along the hierarchical path is determined, the slack is apportioned between the partitions based upon the relative amount of elastic delay contained within each partition
In addition to timing, capacitance and slew values are apportioned to the hierarchical pins This results in hierarchical boundary pin placement and timing assertions allow physical synthesis to be executed on each partition individually
Once all of the blocks have been processed out of context, all of the sequentially terminated paths within a block have been fully optimized, but there still may be some improvement needed on cross-hierarchy paths
In Figure 39.10, consider the path between sequential elements S1 and S2 Two cells on the path
are in block 1 and three cells are in block 2 There is a global net between them going from block pin 1 to block pin 2 There are timing and other assertions on BP1 and BP2 that have been developed during the apportionment phase Out-of-content optimization on block 1 and block 2 may have made these assertions incorrect At this point, one wants to reoptimize this path in a virtually flat way by traversing the path hierarchically and applying optimization along it with accurate (nonapportioned) timing
Note that no additional optimization needs to be done on the logic cloud between sequentials
S0 and S1 because there was no timing approximation needed during out-of-context optimization.
Block 1
Block 2 BP1
BP2
S2
FIGURE 39.10 Hierarchical design example.
Trang 9Further, when the hierarchical optimization is done on the S1 to S2 path, no timing information is needed for the logic between S0 and S1 Eliding the timing on such paths reduces CPU time and the
memory footprint needed for hierarchical processing
Again referring to Figure 39.10, top-level optimization may be performed to buffer the net from BP1 to BP2
39.6.2 HIGH-PERFORMANCECLOCKING
In microprocessor designs, clock frequencies are significantly higher than for ASICs and the transistor counts are large as well Thus, the global clock distribution can contribute up to 50 percent of the total active power in high-performance multihertz designs In a well-designed balanced clock trees, most of the power is consumed at the last level of the tree, that is , the final stage of the tree that drives the latches
The overall clock power can be significantly reduced by constraining each latch to be as physically close as possible to the local clock buffer (LCB) that drives it Figure 39.11 shows this clustering that latches around the LCB One may think that constraining latches in this matter could hurt performance because latches may not be ideally placed However, generally, there is an LCB fairly close to a latch’s ideal location, which means the latch does not have to be moved too far to be placed next to an LCB Further, there can be a positive timing effect because skew is reduced from all the latches being clustered around local clock buffers (as shown in Figure 39.12)
Savings in power are obtained as a result of the reduction in wire load being driven by the clock buffer We have found empirically that clustering latches in this manner reduces the capacitive load
on the LCB by up to 40 percent, compared to unconstrained latch placement; this directly translates into power saving for the local clock buffer
39.6.3 POWERGATING TOREDUCELEAKAGEPOWER
Exponential increase in leakage power has been one of the most challenging issues in sub-90 nm CMOS technologies Power gating is one of most effective techniques to reduce both subthreshold leakage and gate leakage as it cuts off the path to the supply 18, 19 Conceptually, it is a straightforward technique; however, the implementation can be quite tricky in high-performance designs where the performance trade-off is constrained to less than 2 percent of the frequency loss due to power gate
FIGURE 39.11 Latch clustering around LCBs in a high-performance block.
Trang 10FIGURE 39.12 Cluster of latches around a single LCB.
(footer/header switch insertion) Figure 39.13 shows a simple schematic of a logic block that has been power gated by a header switch (PFET) or a footer switch (NFET) Obviously, footer switches are preferred due to the better drive capability of NFETs Operationally, if the logic block is not active, the SLEEP signal can turn off the NFET (footer switch) and the virtual ground (drain of
NFET) will float toward Vdd(supply voltage), thereby reducing the leakage by orders of magnitude Introducing a series transistor (footer/header) in the logic path results in a performance penalty This performance penalty can be mitigated by making the size of the footer/header larger so as to reduce the series resistance However, the leakage benefit reduces with increasing size of the power gate Practically, in low-power applications, over 2000 times leakage saving can be obtained at the expense of 8–10 percent reduction in performance However, in high-performance designs, this is
a relatively large performance penalty So, larger power gate sizes are chosen (approximately 6–8 percent of logic area) to achieve less than 2 percent performance penalty with over 20 times leakage reduction
In general, power gating can be physically implemented in the designs using block-based coarse-grained power gating and intrablock fine power gating (similar to multiple-supply voltages) In a block-based implementation, the footer (or header) switches surround the boundary of the block,
as shown in Figure 39.14 This physical implementation is easier because it does not disturb the internal layout of the block However, it has a potential drawback in terms of larger IR drop on the virtual ground supply For IP blocks, this is the preferred implementation technique for power gating
Sleep Header switch
Logic block Footer switch
Sleep
FIGURE 39.13 Power gating using header/footer switches.