Handbook of algorithms for physical design automation part 34 pot

1 ifC≤0 then /*new cost is less than or equal to the old cost*/ then returnACCEPT FIGURE 16.2 Acceptance function for the simulated annealing algorithm.. 16.3 SIMULATED ANNEALING AND PLA

Trang 1

2 do

9 until cost cannot be reduced

FIGURE 16.1 Basic simulated annealing algorithm.

Unfortunately, the placement problem described here does not exhibit optimal substructure If

we apply the greedy algorithm search strategy, we will usually get stuck in a local minimum This means that

where jminis the local minimum state, and S (jmin) is the set of states reachable from the state jmin

In many cases, there is a large disparity between the local minimum and the global minimum cost We need a search strategy that avoids local minima and finds the global minimum Simulated annealing is such a search strategy

At the heart of the simulated annealing algorithm is the Metropolis Monte Carlo procedure that was introduced to provide an efficient simulation of a collection of atoms in equilibrium at a given temperature [2] The Metropolis procedure is the inner loop of the simulated annealing algorithm as shown in Figure 16.1 Although the greedy algorithm forbids changes of state that increase the cost function, the Metropolis procedure allows moves to states that increase the cost function Kirkpatrick

et al suggested that the Metropolis Monte Carlo method can be used to simulate the physical annealing process and to solve combinatorial optimization problems [1] They suggested adding an outer loop that lowers the temperature from a high melting temperature in slow stages until the system freezes, and no further changes occur At each temperature, the simulation must proceed long enough for the system to reach a steady state The sequence of temperatures and the method to reach equilibrium

at each temperature is known as annealing schedule They showed that this same technique can be applied to combinatorial optimization problems if a cost function is used in place of energy, and the temperature is used as a control parameter

16.2 ANNEALING SCHEDULES

It has been shown that the simulated annealing algorithm, when started in an arbitrary state and given

an appropriate annealing schedule, will eventually converge to a global optimum [3] Although these results required an infinite amount of computation time for the convergence guarantee, in practice, simulated annealing has been extremely successful when applied to circuit partitioning and placement problems It has outperformed all other known algorithms if given sufficient time resources The essential elements of the simulated annealing algorithm are summarized below in Figure 16.1 The algorithm consists of two loops Each execution of the inner loop generates new

configurations to be evaluated at constant temperature The acceptance of a new configuration j depends on the current temperature T and the change in cost between the current configuration i and the proposed configuration j as presented in Figure 16.2 All configuration changes that do not

increase the cost are accepted as in any-iterative improvement algorithm, but moves withC > 0

Trang 2

1 ifC≤0 then /*new cost is less than or equal to the old cost*/

then return(ACCEPT)

FIGURE 16.2 Acceptance function for the simulated annealing algorithm.

are accepted depending on the value ofC and the value of T The Boltzmann distributione−C T that governs physical annealing is used as the criteria for determining acceptance of states with increased cost

In this simple formulation of simulated annealing, we designate that the inner loop is repeated

until the average value of the cost appears to have converged As T is lowered from a high value, large uphill moves are mostly rejected As T is lowered further, moves with yet lower values of C > 0

become largely rejected In some sense, critical decisions are made for those values ofC > 0 which

are on the order of the value of T Hence, simulated annealing operates in a pseudohierarchical fashion with respect toC > 0 values as T is decreased.

16.3 SIMULATED ANNEALING AND PLACEMENT

The critical ingredients for a implementing a successful placer based on simulated annealing are the simulated annealing cooling schedule, the cost function to be evaluated, and the generation of new state configurations or move strategies Although simulated annealing placers are quite straightfor-ward to implement, the best results in terms of quality and execution time have been obtained with careful attention to these details We discuss each of these aspects of simulated annealing in turn

16.4 SIMULATED ANNEALING COOLING SCHEDULES

A simulated annealing cooling schedule is differentiated by the implementation of four lines of the basic annealing schedule presented in Figure 16.1: initial temperature selection, temperature equilibrium criteria, temperature update, and stopping criteria A common implementation is easily coded as shown in Figure 16.3 Here, iterations is a variable which counts the number of Metropolis cycles or inner loop executions, numberOfMoves is a variable which counts the number of generated

new configurations in an iteration, Imax is the predetermined maximum number of iterations, Nmax

is the predetermined maximum number of moves generated per iteration, and∝ is the temperature multiplier

Although the previous implementation is simple, effective, and easily programmed, it has a major drawback: at low temperatures, the running time is very long because many candidates for moves are rejected before each move to a different configuration To remedy this inefficiency, various approaches have been proposed to speed the algorithm including parallel implementations [4–6] as well as rejectionless hill climbing [7] Lam studied the problem and proposed a statistical annealing schedule [8] Lam’s schedule is based on the observation that annealing is successful if the system

is kept close to thermal equilibrium as the temperature is lowered However, to keep the system in equilibrium at all times requires that the temperature decrements be infinitesimal; a long time would have passed before the system is frozen, and annealing is stopped From a practical standpoint, a good annealing schedule must, therefore, achieve a compromise between the quality of the final solution and the computation time To determine when the system is in equilibrium so that the

Trang 3

1 T←largeNumber /*Initial Temperature*/

1biterations←0

2 do

2b numberOfMoves←0

5 ifaccept(C,T)then /*Metropolis function*/

8 T← αT,0.8≤ α ≤0.99 /*Temperature decrement*/

FIGURE 16.3 Simple simulated annealing algorithm.

temperature could be lowered, we need an equilibrium criterion [9] A system is close to equilibrium

at temperature T if the following condition is satisfied

where

¯c is the average cost of the system

s = 1/T is the inverse temperature

µ(s) and σ(s) are the mean and standard deviation of the cost if the system were in thermal equilibrium at temperature T

The parameterλ, which can be made as small as desired to ensure a good approximation of

equilibrium, realizes the compromise between the quality of the final solution and the computa-tion time: the smaller the λ, the better is the quality of the final solution and the longer is the

computation time

Simulated annealing has been applied to the placement problem in the TimberWolf system Com-plete accounts of the implementations of simulated annealing for earlier versions of the TimberWolf placement programs have been published [10–18] The inclusion of the results of a theoretically derived statistical annealing schedule have been responsible for the very significant reduction in the CPU time required by TimberWolf

We now present the adapation of Lam’s statistical annealing schedule [8] found in TimberWolf In his work, Lam showed theoretically that the optimum acceptance rate of proposed new configurations

is approximately 44 percent In Lam’s algorithm, a range limiter window (first described in Ref [10])

is used to keep the acceptance rate (denoted asρ) as close as possible to 44 percent (The range

limiter window bounds the magnitude of the perturbation (or move distance) from the current state The range limiter window size is designed to increase the acceptance rate at a given temperature Changes in cost are on the order of the move distance Therefore, reducing the move distance yields smaller values and hence an elevated acceptance rate.) In the beginning of the execution of this

algorithm, the temperature T is set to a very high value (effectively infinity) Even with the range

limiter dimensions encompassing the entire chip, the acceptance rate ρ approaches 100 percent.

Because a further increase in range limiter dimensions cannot decreaseρ, there clearly must be a

region of operation for the algorithm in whichρ is above the ideal value of 44 percent Also, as

T gets sufficiently low, the range limiter dimensions reduce to their minimum values Then, as ρ

drops below 44 percent, there is no way for it to return to a higher level It is therefore apparent

Trang 4

Acceptance rate r

1.0

0.44

Region 3

Generated new configurations

FIGURE 16.4 Anticipated plot of the acceptance rate versus generated new configurations.

that there is a region of operation in whichρ falls from 44 percent toward zero as T approaches

zero The anticipated three regions of operation (ρ above 0.44, ρ equals 0.44, and ρ below 0.44)

are illustrated in Figure 16.4 One disadvantage of the schedule developed by Lam is its inability

to accurately predict when the execution of the algorithm will end from the beginning of the run That is, it is not known how many new configurations will be generated during the course of the execution of the algorithm In an effort to gain a different perspective on Lam’s theory, the authors

of TimberWolf measuredρ versus generated new configurations for executions on several industrial

circuits One objective was to determine the percentage of the run (i.e., the percentage of the total new configurations generated) devoted to each of the three regions of operation

These percentages were remarkably similar for the very wide range of circuit sizes which were tested A typical plot is shown in Figure 16.5

Acceptance rate r

1.0

0.44

FIGURE 16.5 Typical measured acceptance rate versus generated new configurations as obtained from

exper-iments conducted on several industrial circuits, showing the percentage of the run spent in each region of operation

Trang 5

They discovered that for region 1 (which encompasses approximately 15 percent of the run),ρ

versus generated new configurations could be modeled by an exponential function This function has a peak value of 1.0, and passes through the point whereρ first reduces to 0.44 Furthermore,

they found that region 3 could also be modeled by an exponential function with peak value 0.44 and minimum value 0.0 In region 2, the acceptance rate is flat, but they discovered that the decrease in the range limiter window dimensions as a function of generated new configurations can also be modeled

by exponential functional form That all three regions can be modeled by exponential functions is not surprising in light of the use of the (exponential) Boltzmann-like factor used to govern acceptance or

rejection of new configurations Here they define an iteration (represented by I where 1 ≤ I ≤ Imax)

to correspond to an interval along the horizontal axis in Figure 16.6 That is, Nmaxnew configurations

are generated during iteration I An iteration defines a set of Nmax moves during which the range limiter window dimensions remain constant

In simulated annealing, the more new configurations generated during the course of a run, the higher the probability of achieving a better solution However, extensive experimentation suggested the existence of a diminishing return on the number of new configurations generated Therefore, a default number of moves can be determined for which the best results can be obtained with high probability The default total number of moves during a run is set to

totalmoves= 1500N4/3

where Ncis the number of cells In TimberWolf implementations, they set Imaxequal to 150 iterations Therefore:

Nmax= 10N4/3

Note that the range limiter dimensions are actually changed 50 percent of 150 times, or 75 times during the course of a run (i.e., its dimensions only change during region 2 of the operation of the annealing algorithm)

Because we know that the acceptance rate behavior described in Figure 16.5 along with the default

values of Imaxand Nmaxyield close to the best possible results for simulated annealing, the algorithm

Acceptance rate r

1.0

0.44

FIGURE 16.6 Target acceptance rate versus iteration.

Trang 6

is forced to strictly obey that acceptance rate behavior through the use of a feedback mechanism.

That is, for each iteration I (I varies from 1 to Imax), one can compute the target acceptance rate(ρ T

I )

as shown in Figure 16.6 To ensure that significant further reductions in the cost are not possible, the target acceptance rate is set to be below 1 percent at the last iteration(Imax).

One can force the actual acceptance rate to track the target acceptance rate by using negative

feedback control on the temperature T :

T =

1−ρ I − ρ T

I

K

where K is a damping constant used to stabilize the control of the value of T (in TimberWolf implementations, a very suitable value of K is 40) T is updated every update_limit moves (as defined

in the description of our simulated annealing algorithm in Figure 16.7) Note that T can increase as

well as decrease as the execution of the algorithm proceeds, and the range limiter window dimensions

decrease exponentially as a function of the number of iterations In Lam’s schedule by contrast, T

decreases monotonically but the range limiter window dimensions fluctuate up or down Clearly these two parameters are closely related It is sufficient to dictate the functional form for either one, and let the other parameter adapt to monitored conditions

Algorithmsimulated_annealing(X0)

2 T ←set_initial_T() /*sufficiently sample configuration

space to ascertain value of T yielding

an initial acceptance rate slightly below 100 percent*/

4 whileI≤Imaxdo

6 set_range_limiter_size(I) /*sets range limiter window dimensions*/

8 whileN ≤Nmaxdo

11 ifup=update_limitthen /*we need to update the temperature T*/

13 ifρI < ρT

I then

15 else ifρI> ρT

I then

19 ifaccept(C,T)then

21 I←I+1

FIGURE 16.7 Advanced simulated annealing algorithm.

Trang 7

The heuristic adaptation of Lam’s schedule shown in Figure 16.7 did not show a difference in placement quality for a given execution time as compared to Lam’s original version and was adopted

as the annealing schedule in TimberWolf The TimberWolf approach generates a fixed number of moves for a circuit of a given size, and therefore, the number of iterations is known a priori

16.5 COST FUNCTIONS

One of the advantages of the simulated annealing algorithm is its ability to accommodate any cost function In fact, there are no constraints on the form of the cost function However, recent research has shown that the best results are linear or logarithmically related terms or variables Siarry et al have

“noticed improved convergence toward the correct results when using normalized variables instead

of unnormalized real variables range exploration with the same simulated annealing algorithm” [19] Traditionally, a common cost function for simulated annealing row-based placers is the weighted summation of total half-perimeter wirelength, timing penalty, overlap penalty, row length control penalty, and congestion penalty:

C = β w W + βtPt+ β o P o + β r P r + β c P c (16.6) where

NN

n=1

max

vi,vj ∈n |x i − x j| + max

Pt =

NP

p=1

P o =

k =l

O x (k, l)2

(16.10)

P r =

NR

r=1

P c=

Nx

m=1

Ny

n=1

C g (m, n) =

0, (d mn ≤ s mn )

d mn − s mn, (d mn > s mn ) (16.13)

The wirelength term W is the summation over all nets where each net consists of a set of terminals

v i, and(x i , y i ) is the coordinate of v i The constant NNrepresents the total number of nets present in the design

The timing penalty Ptis the summation of all N Ppath delays in the circuit The generalized delay

function D p is shown as a complex function of resistance R, capacitance C, wirelength l of path, and propagation delay tgthrough the circuit The timing model may utilize lookup tables, Elmore delay calculations, or simple lumped capacitance calculations

Trang 8

The overlap penalty function O x (k, l) returns the amount of overlap of cells k and l in the x

direction of the row (as we assume horizontal rows) The overlap term is used to insure a legal placement at the end of annealing, that is, no two cells overlap in a row or area

The row length penalty function is present to ensure that each row in a standard cell placement is

filled to a desire length The function L (r) returns the length of row r and the function Ld(r) returns the desired length of row r.

The congestion cost PCis calculated by overlaying a two-dimensional global bin structure over

the design Global routing is performed on each net by mapping each terminal vertex v i to its corresponding bin(m, n), collapsing the terminals within a bin, and interconnecting the terminals

spanning the bins Each time a net crosses a bin, the demand for an bin edge is incremented The total demand for a bin is the sum of all bin edges The geometry of the design determines the routing

supply s mn available for the bin An overflow occurs if the demand of a bin d mnexceeds its supply

s dm The congestion is the sum overflows over all global routing bins

Each of the terms of the cost function are multiplied by a scaling factorβ ito balance the relative importance of the term To achieve good results over many different circuits and conditions, a feedback mechanism was proposed to control the individualβ i[12]:

β iI+1= max

0,β iI+P i − PT

i

PT

i

(16.14)

where the scaling factor at the next iterationβ iI+1 is calculated from the current scaling factorβ iI (at the Ith iteration) and an error term representing the deviation of penalty P ifrom the ideal target

penalty PT

i While this does help improve the final result and drive the penalty terms to zero, this method does not adequately determine the initial scaling factorβ i0and may require a damping factor similar to Equation 16.5 to prevent numerical large oscillations of the scaling factor Furthermore, to achieve satisfactory results, this method requires a significant tuning effort

Nevertheless, many simulated annealing placers used cost functions of this general form In fact, the early versions of TimberWolfSC, the row-based simulated annealing placer used the following cost function [12]:

For floorplanning or macrocell placement problems, the overlap penalty becomes two dimensional and an additional term is sometimes added to minimize wasted area between cells known as white space:

P S = AC(s)

where Ac(s) is the total area of the chip including white space and A T is the sum of all of the cell areas In this case, the scaling factor has been defined as [4]:

β s =

K0P s < 1

where K0and K1are two constants such that K0 K1er to ensure feasibility

However, these straightforward functions suffer in that they fail dimensional analysis as the individual terms are not unit compatible This makes the cost function unfit for general use and susceptible to problems tuning the weight factors While one can attempt to optimize the weight factors using a set of benchmark circuits, constant weighting factors precludes optimal solutions over a sufficiently large dynamic range In addition, the feedback control of these weighting factors becomes more unstable as the dynamic range increases; it will become increasingly difficult to

Trang 9

balance the linear terms again quadratic terms Clearly, as problems change in size and topology, the relative attention paid to individual terms of the cost function will vary enormously Yet most published works on placement have cost functions of the mixed form of Equation 16.6 This is due to

an over reliance on benchmarks as a performance measure Benchmarks offered at given technology node are similar in terms of scale and mask the mixed unit cost function problem

One can avoid the problematic mixed unit cost function by rewriting the cost function in terms

of a single dimension, length, making all terms unit compatible The timing penalty can be rewritten

in terms of a path length penalty where the bounds are given or derived from timing analysis [17]:

P p=

⎧

⎨

⎩

length(p) − upperBound(p) length(p) > upperBound(p)

lowerBound(p) − length(p) length(p) < lowerBound(p)

0 otherwise

(16.18)

length(p) =

∀n∈p

The cell overlap penalty may be completely eliminated through the use of cell shifting and the row length control penalty may be eliminated by careful attention to row bounds during new state generation [15] The floorplanning area term can be rewritten as the square root of its area and this was utilized in TimberWolfMC [18]

The congestion penalty is more challenging but it can be rewritten in terms of detour length or the additional length needed by a net to avoid a congested area Kahng and Xu have shown how to effectively calculate the detour length from a congestion map [20] Sun and Sechen [15] proposed just two terms in their cost function for Timberwolf version 7 whereas the commercial version of TimberWolf (aka InternetCAD itools) uses the following strict length-based cost function which utilizes half-perimeter, timing, and detour costs:

16.6 MOVE STRATEGIES

Most simulated annealing placement algorithms predominately use two new configuration strategies

or moves: a relocation of a single cell to a new position and a pairwise exchange of cells Sechen and Lee [12] proposed a bin structure to automatically control the ratio of single cell relocations to pairwise exchanges Each standard cell row is divided into bins The center of each cell is assigned

to a bin A new move is proposed as follows: A cell a is randomly chosen A new position is chosen which resides within the range limiter window and its corresponding bin is calculated If the bin is empty, a single cell move to this position is performed Otherwise, randomly pick cell b from the cells in the bin Cells a and b are exchanged as shown in Figure 16.8

Although the primary new state strategy is the single and pairwise exchange of cells, other new state generators have been proposed and adopted In row-based standard cell placement algorithms, cell orientation and exchange of adjacent cells are common moves at low temperatures Floor-planning or macrocell placers are augmented with aspect ratio modification, and pin optimization moves Simulated annealing device placement algorithms are further enhanced with transistor fold-ing, diffusion mergfold-ing, cell groupfold-ing, and symmetry operations Hustin and Sangiovanni-Vincentelli proposed a dynamic and adaptive move strategy that optimizes the amount of work performed at each

temperature They compute a quality factor for each type of move m for a given temperature T [21]:

Q T

m=

j ∈Am |C j|

Trang 10

Range limiter window

Single cell move

Cell c Pairwise exchange

Cell b Cell a

Now divided into bins

FIGURE 16.8 Automatic move strategy.

where

G m is the number of generated moves

A m is subset of accepted moves of type m, that is, A m ⊆ G m

|C j | is absolute value of the change in cost due to the accepted move m

The probability of proposing the move m at a given temperature is then given by

p T

m =Q T m m

Q T m

(16.22)

As you can see, the quality factor and hence probability of selecting a move m will be high when

moves of this type are frequently accepted or when the average change in cost is large at the current temperature This method discourages small delta cost moves at high temperatures where they would have little impact on the progress of exploring the state space and discourages large delta cost moves

at low temperatures where such moves would drastically perturb the current state and have little chance of acceptance

Sechen and Lee’s work describes many of the details of implementing a simulated annealing placer It is the basis for many of the advanced works in the field It is available in source code in the SPEC CPU2000 benchmark set [22]

16.7 MULTILEVEL METHODS

To reduce the execution time of simulated annealing placement, multilevel methods were introduced Mallela and Grover were the first to introduce a two-step annealing process to standard cell place-ment to reduce runtime [23] The execution time is reduced by effectively reducing the size of the problem through clustering of the standard cells First they form clusters of cells based on their inter-connections Cells that are highly interconnected will be placed into the same cluster The execution time of the clustering algorithm is only a small fraction of simulated annealing placement time The clustered netlist is then placed using simulated annealing placement Because the number of total cells has been reduced, the execution time of the simulated annealing problem is reduced Next, the clusters are broken up and the original netlist is restored Then a final low temperature simulated

Định dạng
Số trang	10
Dung lượng	203,04 KB