Handbook of algorithms for physical design automation part 99 pot

The floorplanning problem can be formulated as assigning nonoverlapping regions to the modules such that each region satisfies resource requirements of the module that is assigned to it,

Trang 1

Multilevel clustering Coarse placement of clusters

FIGURE 46.4 Multilevel clustering and placement (Based on Sankar, Y and Rose, J., Proceedings of the

ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 157–166, 1999 With permission.)

The flow maintains a list of clusters of macros, and a set of buckets that correspond to regions

on the FPGA that have to house one cluster each The buckets are all of the same shape, but unlike the work by Sankar and Rose [12], their shapes are not predetermined by the algorithm Instead, the width (height) of the buckets is determined by the maximum width (height) of all macros initially, and

as clustering progresses, the width might be increased so that it can fit larger clusters For example, the algorithm could start with buckets of size 3× 2, and after a clustering step merge them in pairs

to get buckets of size 6× 2 (it is not clear from Ref [12] if the bucket sizes in this example could be set to 3× 4 too or not, but the initial bucket shapes is determined by the maximum macro width and height) The iterative process of clustering macros and increasing the size of the buckets is repeated until the number of clusters becomes less than or equal to the number of buckets Because the width and height requirements of clusters are calculated using an upper bound method, the buckets are guaranteed to have room for all clusters once there are at least as many buckets as there are clusters Once the clustering phase is done, a tabu-search∗cluster placement step follows In this step, neighboring clusters are swapped using force-directed moves, as in Chapter 18 Once a cluster is moved, it is locked and will not be swapped until a prespecified number of other moves are attempted The force-directed moves use connections between clusters as forces pulling highly connected cluster closer together Toward the end of the intercluster placement phase, critical edges are assigned higher weights in the force calculations, and candidate clusters for swapping are chosen based on their timing criticality rank Hence, the intercluster placement step starts by minimizing average wirelength, and

in its second phase minimizes timing-critical edges

The intercluster placement is done in three phases: first the hard macros are placed next to each

other (same Y -coordinate), then soft macros are assigned coordinates, and finally soft macro shapes

might be changed to fit all macros within the bucket Figure 46.5 shows an example intercluster placement within a bucket of size 12× 9, where modules m16, m19, m27, and m41are hard macros, and the rest of the modules are soft macros Note that the feasibility checks during clustering and bucket resizing guarantees that hard macros can be placed within their assigned buckets

During hard macro placement, the center of gravity of the x-coordinate of all modules connected

to a hard macro is calculated Then hard macros are sorted based on the x-coordinate of the center

of gravity, and placed from the right edge of the bucket to the left in decreasing order of the center of

∗ Tabu search refers to a heuristic search algorithm in which certain moves are tried and a lock is placed on a move after it

is tried so that it cannot be attempted before a certain number of other moves are applied first It is a fast solution space exploration method that tries to avoid getting stuck in local minima by locking moves.

Trang 2

m7

m6

m19

m41 m16

m27

FIGURE 46.5 Intercluster placement example (Based on Emmert, J M and Bhatia, D., Proceedings of the

ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 47–56, 1999 With permission.)

gravity coordinates Soft macros are placed from left to right, filling logic block locations as shown

by the arrow in Figure 46.5 A greedy method moves logic blocks to minimize wirelength

46.3.2 FLOORPLANNING ON FPGAS WITH HETEROGENEOUS RESOURCES

Cheng and Wong [11] consider the floorplanning problem on FPGAs with heterogeneous resources such as memory blocks and embedded multipliers, described in Sections 45.5 and 45.6, respectively The input to the problem is a set of modules with a vector of resource requirements, for example,

φi = (c i , r i , m i ), where φi is the resource requirement vector of module i, and c i , r i , and m iare the number of units of logic, RAM block, and embedded multiplier units that the module needs The floorplanning problem can be formulated as assigning nonoverlapping regions to the modules such that each region satisfies resource requirements of the module that is assigned to it, all modules are assigned regions on the chip, and a given cost function such as wirelength is minimized

Cheng and Wong [11] use slicing floorplans to explore the search space, ensuring the resource requirements are met when assigning locations and sizing the module, as in Section 9.4.1 A post-processing step follows that compacts modules by changing their shape to reduce the area of the floorplan An example floorplan generated by this method is shown in Figure 46.6

To ensure resource requirements are met, the authors define the irreducible realization list (IRL) for each module at any location(x, y) as a list L(θ, x, y) = {r|r ∈ θ , x (r) = x ∧y(r) = y}, where r

is defined as a rectangle r = (x, y, w, h) with bottom-left coordinates (x, y) and dimensions (w, h)

such that it satisfies resource requirements of the module Another condition for the IRL of a module

is that it should be the set of nondominant rectangles that satisfy resource requirements of the module (i.e., no other rectangle at location(x, y) can be found that has a smaller width and a smaller height

and satisfies resource requirements of the module) Figure 46.7 shows IRLs at coordinates (4, 1) and (10, 0) for a module with resource requirement vectorφ = (12, 1, 1) In Figure 46.7, dark modules

are RAM blocks, and long white modules are multipliers

The heterogeneous floorplanning problem discussed in Ref [11] is different from the traditional slicing floorplanning problem described in Chapter 9 Because in the FPGA problem, the shapes a module can take depend on the location it is placed at (see the example of Figure 46.7), whereas

in the traditional problem formulation, the list of the shapes a module can take is prespecified This difference causes challenges when combining two subfloorplans Care must be taken to ensure that the assigned shape of a subfloorplan during the bottom-up sizing process satisfies resource

Trang 3

FIGURE 46.6 Floorplanningexample (BasedonCheng, L andWong, M D F., Proceedings of the IEEE/ACM

International Conference on Computer-Aided Design, 292–299, 2004 With permission.)

requirements of the modules in the subfloorplans The authors in Ref [11] prove that generating the

combined shape of two subfloorplans can be done in O (l log l), where l = max(W, H), in which W and H correspond to the chip width and height, respectively.

A modified slicing-tree annealing-based floorplanning algorithm is used to generate floorplans The sizing process takes care of resource requirements as discussed above, and a cost function that includes floorplan area and wirelength as well as the sum of module aspect ratios is utilized Because the FPGA fabric is tile-based, the authors can do a significant amount of preprocessing on each module, finding its IRL based on the(x, y) coordinate within a tile, and then utilizing the data during

floorplanning A postprocessing step follows that compacts the floorplan (Figure 46.7) Interested readers are referred to Ref [11] for details on the compaction process

46.3.3 DYNAMIC FLOORPLANNING

Bazargan et al introduced a floorplanning method for dynamically reconfigurable systems in Ref [14] Such systems allow modules to be loaded and unloaded on-the-fly to cater to applications’

(0,0) (4,1) (10,0)

FIGURE 46.7 Example of an IRL (Based on Cheng, L and Wong, M D F., Proceedings of the IEEE/ACM

International Conference on Computer-Aided Design, 292–299, 2004 With permission.)

Trang 4

different needs at various points in runtime A module is unloaded to free up space for future modules

to be loaded onto the system A module corresponds to a set of datapath operations such as adders and multipliers that perform computations in a program’s basic block (refer to Section 46.4.4 for a discussion on basic block modules) Limited versions of such systems have been implemented in the past [15,16]

The method in Ref [14] divides the chip into an explicit list of rectangular empty regions, called the list of maximal empty rectangles, and when a new module is to be placed on the chip, its dimensions are compared against the empty rectangles to see if it fits in any of them Interconnections between large modules are ignored in this work, which means the floorplanning problem can be reduced to a two-dimensional bin-packing problem.∗The number of empty rectangles in an arbitrary floorplan with the ability to remove as well as add modules is quadratic in terms of the number

of active modules present on the chip in the worst case [14] As a result, the authors propose to keep a linear list of empty regions to speed up the floorplanning process at the cost of quality If a suboptimal list is used, then there might be cases that an arriving module can fit in an empty region, but the empty region is not present in the maintained list of empty rectangles A number of heuristics are also provided that try to choose an empty rectangle to house a new module that maximizes the chances that large enough empty regions are available to future modules

Handa and Vemuri [17] observe that even though the number of maximal empty rectangles could

be quadratic in theory, in practice the number is more likely to be linear in terms of the number of active modules on the chip Instead of keeping an explicit list of empty rectangles, they encode the FPGA area using a smart data structure that can quickly determine if an empty region is large enough

to house an incoming module

An example floorplan is shown in Figure 46.8 A positive number at a logic block location indicates the height of the empty region above the logic block, and a negative number can be used

to find the distance to the right edge of a module These numbers are used in obtaining maximal staircases, which are data structures that help keep track of empty regions without explicitly storing the location and dimensions of every maximal empty rectangle Such a methodology would improve runtime on average (worst-case delay is still quadratic)

Unlike Bazargan et al [14] and Handa and Vemuri [17] who assume that the floorplanning decisions must be taken at runtime, Singhal and Bozorgzadeh [18] assume that the flow of computations

1 2 3 4 5 6 7 8 9

− 5

− 5 1 2 3 4 5

− 4

− 4 1 2 3 4 5

− 3

− 3 1 2 3 4 5

− 2

− 2 1

− 4

− 4 1

− 1

− 1 1

− 3

− 3 1

1 2 3 4 5

− 2

− 2 1

1 2 3 4 5

− 1

− 1 1

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7

M

P O

N

8 9

FIGURE 46.8 Encoding the area of the floorplan (From Handa, M and Vemuri, R., Proceedings of the

ACM/IEEE Design Automation Conference, pp 960–965, 2004 With permission.)

∗ In real life applications, interconnections between modules cannot be ignored Even if modules do not communicate directly, they need to get the input data and write the results into memory resources and buffers on the FPGA Such interactions are ignored in both Refs [14,17].

Trang 5

Design 1

Case 1

m1

m2

m5

m6 m14

m16

m35

m14m26m37

m6

m7

m4

m3

m1

m2

m3

m3 Case 2

Case 3

Design 2

regions on the chip

FIGURE 46.9 Reusing partial configurations (From Singhal, L and Bozorgzadeh, E., Proceedings of the

2006 International Conference on Field Programmable Logic and Applications, 2006 With permission.)

is known a priori and floorplanning for multiple configurations can be done at compile time The goal of their approach is to floorplan multiple designs so that the number of shared modules between consecutive configurations is maximized while area is minimized and timing constraints are met Figure 46.9 shows three floorplan examples for a two-configuration system In Figure 46.9, design 1

is first loaded on the FPGA, followed by design 2 Assuming that modules m1, m2, , m7 are the same, case 1 in Figure 46.9 only shares the configuration of m1 and m4, while case 3 shares three

modules when doing a transition from design 1 to design 2 As a result, case 3 requires the least amount of configuration time The challenge is to share as many modules as possible between the two floorplans, but at the same time not to increase critical path delay on any of the configurations They propose a new floorplanning data structure called multilayer sequence pairs, which as the name suggests is an extension of the sequence pair data structure, described in Section 11.5 Consider

two designs D1 and D2 Assuming that modules s1, s2, , skare shared between the two designs, and

D1 has exclusive modules m1 , m2, , mM and D2 has exclusive modules n1, n2, , nN, they build one sequence pair consisting of modules{s1, , sk } ∪ {m1, , mM } ∪ {n1, , nN} Floorplanning moves are similar to regular sequence pair moves However, when building horizontal and vertical constraint graphs, edges are not added between exclusive modules from one design to the other As

a result, by construction, the shared modules are placed at the same location in the two designs The cost function includes terms relating to overall area, aspect ratio, configuration length, wirelength of the two designs, and their congestions They compare their method to a method that first floorplans one design independently of the other design, and then fixes the location of the shared modules in the floorplanning of the second design Their approach outperforms the simplistic method because

in their approach they optimize the two floorplans simultaneously Although they showed the results

on only two configurations, their approach could be extended to multiple configurations

46.4 PLACEMENT

Early FPGA placement algorithms emerged as extensions of their ASIC counterparts Simulated annealing was the optimization engine of choice, and still is the most common method for academic placement engines Even though major strides have been taken in improving the quality of FPGA placement tools, there is still much room for improvement as shown in Ref [19] The authors first

Trang 6

synthetically generated a number of circuits with known optimal solution, and then ran a number of FPGA placement algorithms on the circuits and showed that the length of a longest path could be from 10 to 18 percent worse than the optimal solution on the average, and between 34 and 53 percent longer in the worst case These results are for the case in which only one path is timing-critical

If multiple critical paths are present in the circuit, then the results of existing FPGA placement algorithms are on average 23–35 percent worse than the optimal on average, and 41–48 percent worse in the worst case

46.4.1 ISLAND-STYLE FPGA PLACEMENT

There are a number of methods used in the placement of FPGAs The dominant method is based on

a simulated annealing engine, as in Chapter 16 There are also partitioning-based and hierarchical methods that we discuss later in this subsection

Versatile placement and routing [6] is arguably the most popular FPGA placement and routing tool It is widely used in academic and industry research projects VPR originally was developed

to help FPGA architecture designers place and route circuits with various architectural parame-ters (e.g., switch-block architecture, number of tracks to which the input pins of logic blocks connect[Fc], logic output Fc, etc.) Its flow reads an architectural description file along with the technology-mapped netlist

VPR uses a simulate annealing engine to minimize wirelength and congestion The cost function that the annealing algorithm uses is

WiringCost=

Nnets

n=1

q (n)

bbx (n) Cav,x (n)+

bby (n) Cav,y (n)

(46.1)

where

q (n) is a weighting factor that adjusts the wirelength estimation as a function of a net’s number

of terminals

bbxand bbyare the horizontal and vertical spans of a net’s bounding box

Cav,x and C av,y are the average channel capacities in the x and y directions over the bounding box of net n

Function q (n) is defined in Equation 46.2, where T(n) is the number of terminals of net n The

function is equal to 1 for nets with three or fewer terminals, and gradually increases to 2.79 for nets that have at most 50 terminals, and linearly increases for nets with more than 50 terminals

q (n) =

⎧

⎪

RISA [T (n)] 3< T (n) ≤ 50

2.79+ 0.02616 [T (n) − 50] T (n) > 50

(46.2)

Internally, VPR uses a table RISA[] to lookup the value of q for nets that have fewer than 50

ter-minals The values in the table come from the RISA routability model [20] Essentially, RISA models the amount of routing resource sharing when the number of terminals of a net increases Annealing parameters used in VPR automatically adjust to different circuit sizes and costs to achieve high-quality placements Furthermore, the parameters change dynamically in response to improvements

in cost

The timing-driven version of VPR is called TVPR [21] (its placement algorithm is called T-VPlace after VPR’s VPlace) TVPR optimizes for wirelength and timing simultaneously The delay

of a net is estimated using an optimistic delay model For any bounding box that spans from coordi-nates (0, 0) to(x, y), the router is invoked on the FPGA architecture where all routing resources are

Trang 7

are available, the best combination of wire segments and switches will be used to route the net with the smallest possible delay The delay achieved by the router is recorded in a table at indices[x, y].

The process is repeated for 1≤ x ≤ W and 1 ≤ y ≤ H, where W and H are the width and the height

of the FPGA chip The values in the table are optimistic because a net might not be routable using the best routing resources because of congestion Furthermore, because FPGAs are built as arrays of tiles, the values in the delay table are valid for any starting point, and not just (0,0) So the table really stores the values(x, y) The delay between a source node i and sink node j of a two-terminal net is therefore d (i, j) = TableLookup(|xi − x j |, |y i − y j |), where x i and x j are the x-coordinates of nodes i and j and TableLookup is the array storing the precomputed delays The delay values can be

used as lower bound estimations for individual sinks of multiterminal nets A multifanout net can

be broken into two-terminal (source, sink) nets Using the table on individual sinks is valid because buffers are heavily used in FPGA routing trees, effectively cutting off the branches of a route and converting it into two-terminal routes

The timing, wirelength, and congestion costs are combined in TVPR The timing cost is calculated

as a weighted sum of delays of nets The timing cost of a net between source i and sink j is calculated as

NetTimingCost(i, j) = d (i, j) criticality (i, j) β (46.3) criticality(i, j) = 1 − slack (i, j)/Dmax (46.4)

where

slack(i, j) is calculated using static timing analysis, described in Section 3.1.1.3

Dmaxis the critical path delay

Parameterβ can be tuned by the user The timing cost component is defined as TimingCost =

i,jNetTimingCost(i, j) The overall cost function in TVPR is defined as

Cost = λ TimingCost

PrevTimingCost + (1 − λ)PrevWiringCostWiringCost (46.5)

whereλ can be tuned to trade off between timing and congestion Cost is used during the annealing

decision process to accept or reject a move based on its improvement of wiring and timing costs over the previous solution

The timing delay table enables TVPR to balance a reasonable strike between faster runtime and acceptable lower bound estimation on the delays of all nets However, using the lower bound during placement is bound to introduce a disconnect between what the placement engine thinks the router is going to do and what it actually does during routing To overcome the discrepancy between the placement’s notion of net delays and the actual delays after routing, Nag and Rutenbar [22] perform detailed routing at every step of the placement The method is computationally expensive but shows that 8–15 percent improvements in delay can be achieved when using routing inside the placement loop

Maidee et al [23] took a different approach in their partitioning-based placement for FPGAs (PPFF) placement tool: they first placed and routed sample benchmarks and found empirical relation-ships between a net’s wirelength bounding box, its timing-criticality at the end of the routing step, and the type of routing resources used to route it The study would provide a better approximation

of the routing behavior to be used during placement They showed that 5 percent delay improvement can be achieved using the empirical routing models during the annealing placement phase PPFF’s main mode of operation, however, is not annealing It uses a partitioning-based placement engine

We will cover more details of PPFF in Section 46.4.2

Trang 8

46.4.2 HIERARCHICAL FPGA PLACEMENT

A clustered FPGA architecture such as the one discussed in Section 46.2.2 is an example of a hierarchical architecture with one level of hierarchy In general, hierarchical architectures might have several tiers of hierarchies, and usually the lower levels of hierarchy can be connected to each other using faster routing resources

The authors in Ref [24] introduced a hierarchical placement algorithm for a hierarchical FPGA

In a hierarchical architecture, logic blocks are grouped into clusters at different levels of hierarchy The leaf level nodes are the closest and can communicate by fast routing resources The next level of routing would connect clusters of leaves using the second tier routing resources that are slightly slower than the first level resources Their method can accommodate an arbitrary number of hierarchical levels, as long as a higher level hierarchy has slower routing resources than a lower level hierarchy For each output cone in the circuit, they compute lower and upper bounds on the number of hierarchy levels the cone has to pass through For example, if an output cone has only four logic elements in series and the lowest hierarchy contains four logic blocks, the lower bound delay on this cone is four times the delay of the fastest routing resource The upper bound would occur when each

of the blocks on the path inside the cone are placed in different partitions at the highest level of the hierarchy In this scenario, the delay would be four times the summation of the delays of all levels

of hierarchy

After obtaining the lower and upper bounds on the delays of all cones, they divide the cones into three categories The first category contains paths whose lower bound delay is close to the delay constraint of the circuit These paths are labeled critical Paths in the second category are those whose upper bound delays violate the timing constraint, but the lower bound delays do not The third category contains cones whose upper bound delay is smaller than the circuit’s target delay They prune out these paths in the placement process and only focus on the first two categories As a result, they reduce the circuit size by about 50 percent

Another partitioning-based timing-driven placement method for hierarchical FPGAs (specifically Altera’s) was presented by Hutton et al [25] The authors perform timing-analysis at each level of the partitioning and place the netlist, trying to avoid potentially critical nets from becoming critical The difference between this method and the one presented in Ref [24] is that Hutton’s method updates the current estimate of the criticality of the paths as the placement process progresses Senouci’s method computes crude estimates as upper and lower bound delays at the beginning and never updates these estimates

46.4.3 PHYSICAL SYNTHESIS AND INCREMENTAL PLACEMENT METHODS

Physical synthesis refers to the process of simultaneously performing placement and logic opti-mization (e.g., resynthesizing a group of gates, gate duplication, retiming, etc.) Doing so has the advantage that the timing estimations available to the synthesis engine are more accurate and only synthesis optimization moves will be attempted whose benefits would sustain after placement Tim-ing improvements of 20–25 percent have been reported usTim-ing physical synthesis [26,27] compared

to a placement method that does not consider physical synthesis In this section, we primarily focus

on the placement methods used in physical synthesis approaches Because placement and resynthe-sis are attempted iteratively, most of the placement methods used in physical syntheresynthe-sis are of an incremental nature

Chen et al [26] consider LUT duplication to improve timing Timing could be improved by

duplicating a LUT x that is driving two sink LUTs y and z If we call the duplicated LUT x, then

xhas to have the same inputs and the same functionality as x Moving x closer to y, which we

assume is the more timing-critical sink, results in timing improvement because the connection(x, z )

is shorter than(x, z), assuming that we do not increase the wirelength of the input wires to xcompared

to the input wirelengths of x.

Trang 9

(a) (b)

Super feasible region Feasible region

(xv(s), yt(s))

(x(r), y(r)) (x(s), y(s))

(x(t), y(t)) (x ⬘(r),

y ⬘(r))

(x⬘(s),

y ⬘(s))

(x(t), y(t)) (x(s), y(s))

(x(i2) , y(i 2))

(x(i3) , y(i 3)) (x(pi1) , y(pi 1))

(xl(s), yb(s))

Feasible region

FIGURE 46.10 Feasible and super-feasible regions (From Chen, G and Cong, J., Proceedings of the

ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp 51–59, 2005 With permission.)

Placement is modified by either moving or duplicating critical gates such that critical paths become monotone A path is defined to be monotone if the X (Y) coordinate of successive LUTs along its path increase or decrease monotonically If a path is not monotone, it means that it is taking detours and its delay could be improved by moving LUTs (or duplicating the LUT and moving the duplicate) so that the path becomes closer to a monotone path, and hence its wirelength becomes smaller The placement methods that Chen et al use are (1) move a LUT or its duplicate to a location within the feasible or super-feasible region (Figure 46.10), and (2) legalize the placement

immediately Assume that node S gets its inputs from critical nodes i1, i2, , ikand provides an input

to node T The feasible region for node S shown in Figure 46.10a is defined to be the rectangular area in which node S can be moved without increasing the length of the path from any of its critical fanin nodes i1, i2, , ik to its fanout node T

A super-feasible region shown in Figure 46.10b is a rectangular area that not only does not increase the length of the path from an immediate fanin to the fanout, but it also converts all global paths from the primary inputs in the fanin cone of a node to its fanout node into monotone paths Moving a node to its desired destination location might result in overcrowding on the nodes, as the destination configurable logic block, which consists of LUTs and local interconnects in modern FPGAs (CLB) might already be fully utilized Hence, there is a need for some legalization method after a placement move

To choose a particular location in a feasible or super-feasible region to move a node to, they consider the replacement cost after legalization The replacement cost is a linear combination of the slack improvement of the node being moved, the congestion cost of the destination, and the accumulative cost of moving other nodes to legalize the placement The legalization procedure

is an improvement over Mongrel [28] The goal of the legalization procedure is to move nodes from overcrowded regions toward empty regions by minimally disturbing the placement Assuming that the overcrowded CLB is at location (x, y) and the vacant CLB is at (x + w, y + h), a grid graph is constructed with w × h nodes in which each node has outgoing edges to its east and north

neighbors Finding a path from the lower left to the upper right node in the grid graph determines the consecutive CLBs that LUT nodes should move through, resulting in a ripple move that transfers LUTs from overcrowded to vacant regions The weight on an edge is determined by the amount

of disturbance to a cost function (e.g., wirelength or delay) as a result of that particular move To minimally change the current placemet, a node will only move one unit to the right or up One of the LUTs at the newly overcrowded CLB must in turn move either to the right or up The goal is to find a sequence of replacements from(x, y) to (x + w, y + h) such that the overall cost of replacing

the nodes is minimized The authors in Ref [26] solve this problem optimally using a longest path approach for cost functions that are linear in terms of the change in the physical location of the nodes For example, a wirelength-based cost function can be solved optimally, but solving for minimum

Trang 10

delay change cannot, because change in the delay of a path containing a LUT that is moved is not a linear function of the amount of dislocation of that LUT

Another incremental placement method is presented by Singh and Brown [27] In their approach, they define a minimally disruptive placement to be an incremental placement, which (1) is a legal placement, (2) does not increase the delay on the critical paths, and (3) does not increase routing area Condition (1) means that the incremental placement algorithm must be flexible enough to handle many architectural constraints such as the number of inputs to a CLB, the flexibility in connections

of the registers within the CLB, etc Condition (2) above means that a node can move anywhere as long as it does not violate current timing constraints And finally, condition (3) means that the new placement is desired to be routable

The incremental placement algorithm starts by moving a few nodes to their preferred locations, determined by the synthesis engine Then architectural violations are gradually removed by itera-tively modifying the placement At every placement move, a combination of three cost functions, namely cluster legality cost, timing cost, and congestion cost∗is evaluated and the move is accepted greedily, that is, a move is only accepted if it reduces the overall cost The legality cost component includes the legality of clusters based on the number of inputs, outputs, LUTs, etc In their timing cost, they introduce a damping function that limits the range of movement of a node based on its slack: the larger the slack, the farther the node can move This is designed to reduce fluctuations in the timing cost because of near-critical nodes becoming critical as a result of moving long distances The move set includes moving a candidate node to either the cluster containing one of its fanin nodes, one of its fanout nodes, a neighboring CLB, any random vacant CLB, move in the direction of critical vector, or its sibling cluster The notion of a sibling cluster is shown in Figure 46.11 Moving

in the direction of critical vector is similar to moving a node within the feasible region in the Chen and Cong work [26]

To avoid getting stuck in local minima, the authors propose a hill climbing method after a number of greedy moves fail to resolve architectural violations The violation costs of CLBs that have not been resolved for a long time are increased compared to other CLBs, allowing LUTs to

Fanins

Sibling Sibling

Fanout

x

FIGURE 46.11 Fanin, fanout, and sibl4ng relationships (Based on Singh, D P and Brown, S D., Proceedings

of the IEEE/ACM International Conference on Computer-Aided Design, pp 752–759, 2002.)

∗ The authors call the congestion cost the wirelength cost, but they essentially evaluate congestion, not wirelength.

Định dạng
Số trang	10
Dung lượng	306,17 KB