High Level Synthesis: from Algorithm to Digital Circuit- P20 ppsx

10.8, one builds a register transfer flow graph RFG [4] from the C statements which represents a Data Flow Graph in which the binding of the variables on the registers has been performed

Trang 1

Table 10.1 Bit-size of hardware operators

a < b, a <= b, max(s a ,s b) + 1 max(s a ,s b) + 1

a&b min(s r ,s a ,s b) min(s a ,s b)

a |b, a b , a == b, a! = b min(s r,max(sa,sb)) max(s a,sb)

a! = 0||b! = 0, a! = 0&&b! = 0 1 1

a << b, a >> b max(32,s a) max(32,s a)

“r=a*b+c;” or not, such as “t[a*b+c];” or “if (a<b)” These cases are

different, because the C language enforces integer promotion in expression, but in

case of assignment the size of the result is known and can be used to minimize the

operator size In the formula, sa, sb, sr correspond respectively to the size of a, b and

the expected result

Following the C standard is a bit costly, because only the assignments can be

optimized For example, in “short t[10],a, b; t[a+b] = 0;”, the + operator must be 17 bit wide, while four bit would be enough Using 17 bit

is compulsory because otherwise the hardware and the C will not be

equiv-alent The shift operator case is quite expensive because in the statement

“char x=1, y=12; x = x << y;”, the C standard indicates to promote

x on 32 bit and set the shift value to y%32 prior to shift, so x is set to

0, while using a 8 bit shifter would lead to set x to 16 One can work around this problem by asking explicitly for a 8 bit shifter with the statement

10.3.3.2 Path Discovery

The DDP can define more or less accurately the data-path The nodes (functional

and sequential macro-cells except of the multiplexers and logic gates) are mandatory

but the arcs are optional so the minimal description of the DDP presented Fig 10.7a

is:

MODEL GCD(IN instream;OUT outstream) {

DFF x,y;

SUB subst;

}

CGS starts by adding all the missing arcs such as for instance “instream →

subst.a” In the following we call the arcs created by CGS added arcs and user arcs those defined in the DDP description Note that CGS has an option which

disables the addition of arcs to let one defines the data-path very accurately

As opposed to the other high level synthesis compilers that build a path for each

expression, CGS must find for every C expression the set of paths in the DDP that

Trang 2

allow to realize it It firstly searches paths with only user arcs and if such paths are not found then it searches paths mixing user and added arcs.

Searching paths is quite difficult due to the commutativity, distributivity, asso-ciativity, and various equivalences among arithmetic and logic operations For instance, the expression a+˜b+1can be mapped on a subtractor and the expres-sion(a&0xFF)+(b&0xFF)does not require any&operator Furthermore, in this

last example, an adder with 8 bit inputs is enough, independently of sa and sb.

To the best of our knowledge, there is no canonical representation for general arithmetic and logic expressions, identical to what the BDD [2] are for Boolean expressions In our implementation, the logical masks with constant values are replaced by wiring The path discovery is done by a brute force algorithm which knows the operators properties and some equivalence rules After a predefined num-ber of trials, the algorithm indicates that it cannot find a path If a path really exists,

the user must help the tool by indicating it explicitly in the C expression This is

done by reordering the operations and adding parenthesis

10.3.3.3 Scheduling Algorithm with Register Bindings

Firstly as shown Fig 10.8, one builds a register transfer flow graph (RFG) [4] from the C statements which represents a Data Flow Graph in which the binding of the

variables on the registers has been performed, thus mixing data dependencies and

hardware constraints In such a graph there are three types of relations [20]: the RaW

(read after write) relation is set when the destination node reads a register written by

the source node, the WaR (write after read) relation is set when the destination node writes a register read by the source node, the WaW (write after write) relation is set

when the destination and source nodes write the same register All these relations indicate that the destination node must be scheduled at least one cycle later than the source node Secondly, all the execution paths of each register transfer instruction are computed as explained in Sect 10.3.3.2

Then the algorithm schedules the register transfer instructions of the RFG using

a kind of list scheduling [18] At a given cycle, a node of the RFG can be scheduled

if all its predecessors have been scheduled in the previous cycles and if there is

R1←− R10 + R11 (a)

R2←− R1 + 1 (b)

R1←− R10 + R12 (c)

R3←− R1 + 1 (d)

a

b

d c

RaW

WaR

RaW WaW

Register transfer instructions Graph

Fig 10.8 Register transfer flow graph

Trang 3

one free execution path with all its nodes being free The main objective of the algorithm is to obtain the minimal number of cycles and then to minimize the data-path area This last objective occurs when there are several free execution data-paths for

an instruction In this case, the algorithm chooses the path which minimizes the cell bit sizes and/or the multiplexer sizes

It has been shown in [8] that the WaW relation is superfluous, and that the WaR

relations tend to over-constrain the scheduling So the idea is to start the scheduling

using only the true data dependencies (RaW) and to add the WaR constraints during

the execution of the algorithm to ensure the correctness of the computations This allows for more scheduling choices and potentially better solutions

The algorithm is presented in Algorithm 1 This algorithm may deadlock because

adding the WaR arcs during the scheduling may create cycles in the graph, thus

lead-ing to a schedullead-ing that is not compatible with the register bindlead-ings These cycles are due to implicit register dependencies An algorithm that minimizes these depen-dencies has been devised, but at worst backtracking must be applied, leading to

an exponential computation time A formal complexity analysis of the scheduling problem with register bindings as we have defined it has been done in [7] This work proves that it is NP-complete to decide if scheduling a given node first will lead to

a deadlock or not

Nevertheless, the algorithm is usable and fast in practice, even on complex inputs,

as it can be seen in Sect 10.6

Algorithm 1 RFG scheduling algorithm

Require: N the set of RFG nodes and R the set of RaW arcs

Ensure: S the set of scheduled nodes

Let Sc the set of nodes scheduled at cycle c, W the set of arcs of type WaR, c the current

scheduling cycle,υ the current node, u a node, s a successor node of υ, w a node that writes into

a register and a (u1,u2)an arc from u1to u2.

c ← 0

while

Choose the best node that does not create a conflict using the select function that selects the node with the lowest mobility

υ ← select({u ∈ N such that ∀w ∈ (S c ∪ N),a (w,u) /∈ (R ∪ W)}

if

Has a node being chosen for the current cycle ?

Sc ← S c ∪ { υ}

N ← N\{υ}

for all w ∈ N such that w has the same destination register than u do

for all s (υ,s)∈ R do

W ← W ∪ {a (s,w) }

end for

else

c ← c + 1

end if

end while

Trang 4

10.3.3.4 Binding of the Combinational Operators

If the scheduling aims at minimizing the global execution time of the circuit, the combinational operator binding phase aims at minimizing its area once the scheduling known

In the user guided synthesis, the operator number and kind are known, so the only degrees of freedom concern the number of inputs of the added multiplexers and the bit sizes of the arithmetic and logic operators Minimizing both has the nice property that it also lowers the operators propagation time

Under the assumption that there is a multiplexer in front of each combinational operator input, the optimization of the binding phase corresponds to minimize the size of the multiplexers Each input is connected to a virtual multiplexer with at least one input, and the binding phase chooses at every cycle the binding that minimizes the multiplexer cost, computed as the number of inputs times the number of bits This simple function allows to rank the solutions correctly, using as cost for the entire data-path the sum of the mux costs

It has been shown in [4] that a simple exchange of commutative operators operands allows to decrease the number of inputs of the multiplexers by 30% (More elaborate solutions can reach 40% [5] at a higher computational cost.) For each control step, the set of possible bindings of the operations and their operands on the physical operators and their inputs is built Starting from an initial binding, we search in the sets a binding that minimizes the cost function, and we apply this binding This is repeated until there is no binding that minimizes the cost

10.4 Data-Path Implementation and Analysis

The link of high level synthesis with low level synthesis tools is seldom described in the literature The synthesis tools most often generate a VHDL standard cell netlist The circuit is obtained by placing and routing the VHDL netlist The generated circuit will probably not run at the expected frequency The main reasons are that the FSM has been constructed with estimated operator and connection delays, and that often the FSM is a Mealy one and its commands may have long delays Furthermore

it is also possible that the circuit does not run at any frequency if it mixes short and long paths This happens frequently in circuits having both registers and register files

Of course, these problems also occur with designs done by hand: in that case the designer solves them by adding states to the FSM, adding buffers to speed up or down some paths This is not easy, and it takes time, but it is possible because he has an intimate knowledge of the design After high level synthesis, these problems can not be corrected because the designer has lost the knowledge of the design From our point of view this mapping phase is an issue that must be dealt with, and not a minor one, because the generated circuit must run as it comes out of the tool If it is not the case the synthesis tool is simply unusable

Trang 5

In UGH, the mapping is done in three steps:

1 Logic synthesis preparation: The data-path produced by CGS is translated to a

synthesizable VHDL description The data-path is described structurally as an interconnection of UGH macro-cells Every macro-cell is described as a behav-ior Furthermore, a shell script is generated to automatically run the synthesis of each VHDL description using a standard cell library and giving constraints such

as maximum fan out for the connectors

2 Logic synthesis: The execution of this script invokes a logic synthesis tool to

generate structural VHDL files respecting the given constraints

3 Delay extraction: For each macro-cell instantiated in the CGS data-path, we

extract the delays from the corresponding VHDL file produced by the logic syn-thesis For that, we have the characteristics of the standard cells and we apply the following rules for computing, in this order, the minimum and maximum propagation times, the setup and hold times:

tminI O = min

p ∈PIO ∑

(ci ,c o)∈p

propmin(ci ,c o ,l c i ,l c o)

tmaxI O = max

p ∈PIO ∑

(ci ,c o)∈p

propmax(ci ,c o ,l c i ,l c o)

(ci ,c ck)∈C (tmax I ci +tsetup cicck −t minc ckCK)

tholdI CK = max

(ci ,c ck)∈C (tmin I ci +thold cicck −t max c ckCK)

In these formulae, I, CK and O represent the macro-cell inputs and outputs,Px y

is the set of paths from the port x to the port y, a path p being a sequence of

couples of ports of the same cell C is a set of couples of input ports of the

same cell having setup and hold times propmin and propmax are the functions characterizing the standard cells taking into account the input and output loads

(lc i ,l c o)

Of course this step may be quite long for large data-paths For this reason, UGH gives the possibility to bypass the mapping during design tuning and instead uses pessimistic estimated delays

Currently, this delay extraction is implemented for the Synopsys tools Further-more, even though the backend tools use VHDL, they use different VHDL dialects This requires to adapt the mapping tool to the backend

10.5 Fine Grain Scheduling

The arrow in the Y chart of the Fig 10.2c represents FGS It shows that its job is to retime [21] a FSM

We illustrate the algorithm on a small example The Fig 10.9 presents the inputs of Fine Grain Scheduler: (1) a data-path with known electrical (Fig 10.9a); (2) the RTL instructions directly extracted from the CG-FSM control-steps

Trang 6

c1

h

h x

y

S

c0

a) Data-path

t0: r0=f(x0, y0) t2: r0=f(x0, c0) t1: y =h(c1, g(y0, r0)) t3: x =h(c1, r0)

b) Ordered list of transfers

Fig 10.9 Inputs of the FGS algorithm

(Fig 10.9b), those are called transfers, and their order matters; (3) a running frequency

FGS deals with the scheduling of basic-blocks As a reminder, a basic-block is

a sequence of RTL instructions without any control statements, except optionally the last one Furthermore, in the global program, there is no branch instruction that jumps in the basic block, except at its beginning

The idea behind FGS is to reorganize the basic-blocks of the CG-FSM, mov-ing instructions from one control-step to either a close control-step or to an added control-step, and then suppressing the useless control-steps

10.5.1 Definitions

Transfer A transfer is the motion of data from the outputs of a set of registers to

the input of a target register

A transfer t is represented as a DAG, D t (V t ,A t), whose vertices are operations

and arcs are data dependencies as realized on the data-path The Fig 10.10a shows

the DAG of the t0 transfer of Fig 10.9 In this DAG, the rectangles represent the

output of the control unit (memorized in the micro-instruction register MIR), and the circles represent functional operations There are three kind of vertices:

COP Concurrent OPerations do not modify the state of the data-path For instance,

changing the selection command of a multiplexer in a control-step only assigns MIR The next control-step may restore the previous value and so restore the circuit in the previous state They correspond to a value on bit fields of MIR Two COPs are equivalent if they match the same bit field

POP Permanent OPerations always perform the same task and are associated to a

single functional resource

SOP Sequential OPerations modify the state of the data-path They perform

mem-orization operation: Once done, the overwritten value is lost They usually correspond to a data-path register, and a bit field of MIR Two SOPs are equivalent if they match the same bit field

Trang 7

permanent operation concurrent operation sequential operation

r0

y0

f

m

S=0 x0

m

f

c0

f

h

y

h

x c1

t1

t2 t0

t3

a) DAG of the t0 transfer b) transfer graph of Figure 1.9

Fig 10.10 Transfer DAG and transfer graph

A transfer D t (V t ,A t) has the following structural properties:

– V source t the set of vertices that have no predecessors V source t only contains COP and SOP

– V sink t the set of vertices that have no successors.|V t

sink | = 1 and its element is a

SOP

– V t

operator = V t − (V t

source ∪V t

sink ) All elements of V t

operatorare POPs

Transfer Graph A transfer graph is a directed acyclic graph, D (V,A), that

rep-resents the set of transfers that occur in the data-path for a given top level FSM transition The transfer graph is the concatenation of all transfers of the input list

in the list order (Fig 10.9b) The transfer D t is added to the graph, and the vertices

v ∈ V j

source are merged to the most recently added equivalent vertices Fig 10.10b shows the transfer graph resulting of the example of Fig 10.9

Characterized Transfer A characterized vertex is a vertex annotated with delays

(see Fig 10.11a)

A POP vertex has a value associated to each couple of incoming and out-going arcs of the vertex These values represent the set of propagation times of the corresponding physical cell

A COP vertex has only one value associated to the outgoing arc, it corresponds

to the propagation time from the clock to the MIR output bits associated to the COP

A SOP vertex has two values associated to each incoming arc and one for each outgoing arc They represent the set-up and hold times from the input relative to the clock and the propagation time from the clock to the output from the corresponding physical cell

These values are delays extracted from the physical placed and routed data-path,

so wire delays are taken into account

Trang 8

propagation time setup time hold time

r0

a) Characterized vertex b) Characterized DAG of t0

Fig 10.11 Characterized vertex

The characterized transfer is obtained by replacing the original transfer vertices

by characterized vertices Figure 10.11b shows the characterized transfer of the tran-sfer presented Fig 10.10a The values of the characterized vertices are graphically represented by the length of the plain arrows

Characterized Transfer Graph It is obtained from the transfer graph by

replac-ing transfers with characterized transfers Nevertheless other arcs must be added to correctly model the behavior of the initial transfer sequence These arcs implement

the WaR and WaW precedence relations.

– The RaW relation denotes the usual data dependencies.

– The WaR relation expresses the fact that two equivalent COPs are used with dif-ferent values In our example this occurs for S = 0 in the t0 transfer and S = 1 in

the t2 transfer S can be set to 1 only when this will not disturb the t0 transfer This gives the arc from the r0 hold time to S= 1 propagation time

– The WaW relation indicates that two equivalent SOPs are used within two differ-ent transfers In the example, r0 is used simultaneously in t0 and t2 The SOP of t2 must be performed after the SOP of t0, because the same register cannot be

loaded twice in the same cycle

The resulting graph is plotted in Fig 10.12a, with the previous relations outlined

10.5.2 Scheduling of a Basic Block

The scheduling rules are:

R1 Load a given register only once in a given cycle

R2 Loading a register must respect its set-up time

R3 Loading a register must not violate the hold time

The clock period defines a grid on the which the SOPs and the COPs must be

snapped A simple ASAP [19] algorithm with the constraint that all arcs point

down-wards (Fig 10.12b) produces a scheduling that verifies the scheduling rules This

Trang 9

y0 S=0 x0 S=1

h

m m f

r0 WAW WAR

m m

f

r0

y

WAR g

x h

h

x h

m m f r0

y g

y0 S=0 x0

S=1

m m f r0 WAR WAW

WAR

cycle 1

cycle 2 cycle 0

cycle 3

a) Characterized transfer graph b) Scheduled transfer graph

Fig 10.12 Characterized and scheduled transfer graph

pointing downwards relation is either combinational when concurrent operations

are involved or sequential when a permanent operation is involved

Rule R1 is enforced by the arcs implementing the WaW relations Rule R2

is enforced by RaW relations (data dependencies: the plain arrows) Rule R3 is enforced by the the arcs implementing the WaR relations.

This scheduling allows all kinds of chaining and especially multi-cycles chaining without intermediate memorizations

The only delays not taken into account are the propagation times from the FSM state register to the data-path This is solved because the control unit is a Moore FSM with a MIR that synchronizes the control signals and that we assume that the delays due to routing capacitances between the MIR and the operator command inputs are similar This can be ensured by increasing the fan-out of the MIR buffers

10.5.3 Optimization of the WaR Relations

The Fig 10.13 represents the scheduling of our example using a double frequency

comparatively to the scheduling given Fig 10.12b, only the t0 transfer and the beginning of t2 transfer are represented.

In the FGS implementation, the arcs expressing the WaR relation do not start from the hold time of the SOP but from the hold time minus mt (see Fig 10.13a) where mt is the minimum propagation time from the COP (S = 0) to the SOP (r0).

Trang 10

S=1 f

y0 S=0 cycle 0

cycle 1

cycle 2 r0

WAR

mt

f

y0 S=0

S=1

cycle 0

cycle 2 r0

mt

WAR

Fig 10.13 Optimized scheduling of transfer graph

This allows when mt is large enough to anticipate the scheduling of the COP (S= 1)

as shown on the Fig 10.13b and so to get a better scheduling

Furthermore, this feature allows to automatically schedule wave pipeline [12] provided that minimum and maximum probation times are close

10.5.4 Scheduling Quality

The list order is used to set the WaR and WaW relations in the characterized transfer

graph

Unfortunately, the list of transfers only gives the data dependence relations

(RaW) and thus defines only a partial order on the transfers This fact induces that

for a given list of transfers, there are in general several characterized transfer graphs, and as many different valid schedules

To taxonomize this, we introduce three relations between the transfers

T i −→ T DD j(data dependent): SOP i belongs to V source It means that Tj j uses the

result of Ti It is the classical RAW relation.

T i ←→ T SD j(sequential dependent): SOP i and SOPj are the same and there is no

direct or transitive Ti −→ T DD j relation nor Tj −→ T DD i relation It means that the same resource is used to store two results of potentially concurrent transfers

T i

CD

←→ T j(concurrent dependent): there is a COP element of both V source i and

V source j which selects a different value and there is no direct or transitive Ti −→ T DD j relation nor Tj −→ T DD irelation It means that the same functional operator is used

in both transfers but performs different functions in each transfer

These relations allow to define three transfer graph classes

Sequential-ordered transfer graph: This is the initial data, with all the−→, DD SD

←→

and←→ relations, CD

Định dạng
Số trang	10
Dung lượng	344,88 KB