10.8, one builds a register transfer flow graph RFG [4] from the C statements which represents a Data Flow Graph in which the binding of the variables on the registers has been performed
Trang 1Table 10.1 Bit-size of hardware operators
a < b, a <= b, max(s a ,s b) + 1 max(s a ,s b) + 1
a&b min(s r ,s a ,s b) min(s a ,s b)
a |b, a b , a == b, a! = b min(s r,max(sa,sb)) max(s a,sb)
a! = 0||b! = 0, a! = 0&&b! = 0 1 1
a << b, a >> b max(32,s a) max(32,s a)
“r=a*b+c;” or not, such as “t[a*b+c];” or “if (a<b)” These cases are
different, because the C language enforces integer promotion in expression, but in
case of assignment the size of the result is known and can be used to minimize the
operator size In the formula, sa, sb, sr correspond respectively to the size of a, b and
the expected result
Following the C standard is a bit costly, because only the assignments can be
optimized For example, in “short t[10],a, b; t[a+b] = 0;”, the + operator must be 17 bit wide, while four bit would be enough Using 17 bit
is compulsory because otherwise the hardware and the C will not be
equiv-alent The shift operator case is quite expensive because in the statement
“char x=1, y=12; x = x << y;”, the C standard indicates to promote
x on 32 bit and set the shift value to y%32 prior to shift, so x is set to
0, while using a 8 bit shifter would lead to set x to 16 One can work around this problem by asking explicitly for a 8 bit shifter with the statement
10.3.3.2 Path Discovery
The DDP can define more or less accurately the data-path The nodes (functional
and sequential macro-cells except of the multiplexers and logic gates) are mandatory
but the arcs are optional so the minimal description of the DDP presented Fig 10.7a
is:
MODEL GCD(IN instream;OUT outstream) {
DFF x,y;
SUB subst;
}
CGS starts by adding all the missing arcs such as for instance “instream →
subst.a” In the following we call the arcs created by CGS added arcs and user arcs those defined in the DDP description Note that CGS has an option which
disables the addition of arcs to let one defines the data-path very accurately
As opposed to the other high level synthesis compilers that build a path for each
expression, CGS must find for every C expression the set of paths in the DDP that
Trang 2allow to realize it It firstly searches paths with only user arcs and if such paths are not found then it searches paths mixing user and added arcs.
Searching paths is quite difficult due to the commutativity, distributivity, asso-ciativity, and various equivalences among arithmetic and logic operations For instance, the expression a+˜b+1can be mapped on a subtractor and the expres-sion(a&0xFF)+(b&0xFF)does not require any&operator Furthermore, in this
last example, an adder with 8 bit inputs is enough, independently of sa and sb.
To the best of our knowledge, there is no canonical representation for general arithmetic and logic expressions, identical to what the BDD [2] are for Boolean expressions In our implementation, the logical masks with constant values are replaced by wiring The path discovery is done by a brute force algorithm which knows the operators properties and some equivalence rules After a predefined num-ber of trials, the algorithm indicates that it cannot find a path If a path really exists,
the user must help the tool by indicating it explicitly in the C expression This is
done by reordering the operations and adding parenthesis
10.3.3.3 Scheduling Algorithm with Register Bindings
Firstly as shown Fig 10.8, one builds a register transfer flow graph (RFG) [4] from the C statements which represents a Data Flow Graph in which the binding of the
variables on the registers has been performed, thus mixing data dependencies and
hardware constraints In such a graph there are three types of relations [20]: the RaW
(read after write) relation is set when the destination node reads a register written by
the source node, the WaR (write after read) relation is set when the destination node writes a register read by the source node, the WaW (write after write) relation is set
when the destination and source nodes write the same register All these relations indicate that the destination node must be scheduled at least one cycle later than the source node Secondly, all the execution paths of each register transfer instruction are computed as explained in Sect 10.3.3.2
Then the algorithm schedules the register transfer instructions of the RFG using
a kind of list scheduling [18] At a given cycle, a node of the RFG can be scheduled
if all its predecessors have been scheduled in the previous cycles and if there is
R1←− R10 + R11 (a)
R2←− R1 + 1 (b)
R1←− R10 + R12 (c)
R3←− R1 + 1 (d)
a
b
d c
RaW
WaR
RaW WaW
Register transfer instructions Graph
Fig 10.8 Register transfer flow graph
Trang 3one free execution path with all its nodes being free The main objective of the algorithm is to obtain the minimal number of cycles and then to minimize the data-path area This last objective occurs when there are several free execution data-paths for
an instruction In this case, the algorithm chooses the path which minimizes the cell bit sizes and/or the multiplexer sizes
It has been shown in [8] that the WaW relation is superfluous, and that the WaR
relations tend to over-constrain the scheduling So the idea is to start the scheduling
using only the true data dependencies (RaW) and to add the WaR constraints during
the execution of the algorithm to ensure the correctness of the computations This allows for more scheduling choices and potentially better solutions
The algorithm is presented in Algorithm 1 This algorithm may deadlock because
adding the WaR arcs during the scheduling may create cycles in the graph, thus
lead-ing to a schedullead-ing that is not compatible with the register bindlead-ings These cycles are due to implicit register dependencies An algorithm that minimizes these depen-dencies has been devised, but at worst backtracking must be applied, leading to
an exponential computation time A formal complexity analysis of the scheduling problem with register bindings as we have defined it has been done in [7] This work proves that it is NP-complete to decide if scheduling a given node first will lead to
a deadlock or not
Nevertheless, the algorithm is usable and fast in practice, even on complex inputs,
as it can be seen in Sect 10.6
Algorithm 1 RFG scheduling algorithm
Require: N the set of RFG nodes and R the set of RaW arcs
Ensure: S the set of scheduled nodes
Let Sc the set of nodes scheduled at cycle c, W the set of arcs of type WaR, c the current
scheduling cycle,υ the current node, u a node, s a successor node of υ, w a node that writes into
a register and a (u1,u2)an arc from u1to u2.
c ← 0
while
Choose the best node that does not create a conflict using the select function that selects the node with the lowest mobility
υ ← select({u ∈ N such that ∀w ∈ (S c ∪ N),a (w,u) /∈ (R ∪ W)}
if
Has a node being chosen for the current cycle ?
Sc ← S c ∪ { υ}
N ← N\{υ}
for all w ∈ N such that w has the same destination register than u do
for all s (υ,s)∈ R do
W ← W ∪ {a (s,w) }
end for
end for
else
c ← c + 1
end if
end while
Trang 410.3.3.4 Binding of the Combinational Operators
If the scheduling aims at minimizing the global execution time of the circuit, the combinational operator binding phase aims at minimizing its area once the scheduling known
In the user guided synthesis, the operator number and kind are known, so the only degrees of freedom concern the number of inputs of the added multiplexers and the bit sizes of the arithmetic and logic operators Minimizing both has the nice property that it also lowers the operators propagation time
Under the assumption that there is a multiplexer in front of each combinational operator input, the optimization of the binding phase corresponds to minimize the size of the multiplexers Each input is connected to a virtual multiplexer with at least one input, and the binding phase chooses at every cycle the binding that minimizes the multiplexer cost, computed as the number of inputs times the number of bits This simple function allows to rank the solutions correctly, using as cost for the entire data-path the sum of the mux costs
It has been shown in [4] that a simple exchange of commutative operators operands allows to decrease the number of inputs of the multiplexers by 30% (More elaborate solutions can reach 40% [5] at a higher computational cost.) For each control step, the set of possible bindings of the operations and their operands on the physical operators and their inputs is built Starting from an initial binding, we search in the sets a binding that minimizes the cost function, and we apply this binding This is repeated until there is no binding that minimizes the cost
10.4 Data-Path Implementation and Analysis
The link of high level synthesis with low level synthesis tools is seldom described in the literature The synthesis tools most often generate a VHDL standard cell netlist The circuit is obtained by placing and routing the VHDL netlist The generated circuit will probably not run at the expected frequency The main reasons are that the FSM has been constructed with estimated operator and connection delays, and that often the FSM is a Mealy one and its commands may have long delays Furthermore
it is also possible that the circuit does not run at any frequency if it mixes short and long paths This happens frequently in circuits having both registers and register files
Of course, these problems also occur with designs done by hand: in that case the designer solves them by adding states to the FSM, adding buffers to speed up or down some paths This is not easy, and it takes time, but it is possible because he has an intimate knowledge of the design After high level synthesis, these problems can not be corrected because the designer has lost the knowledge of the design From our point of view this mapping phase is an issue that must be dealt with, and not a minor one, because the generated circuit must run as it comes out of the tool If it is not the case the synthesis tool is simply unusable
Trang 5In UGH, the mapping is done in three steps:
1 Logic synthesis preparation: The data-path produced by CGS is translated to a
synthesizable VHDL description The data-path is described structurally as an interconnection of UGH macro-cells Every macro-cell is described as a behav-ior Furthermore, a shell script is generated to automatically run the synthesis of each VHDL description using a standard cell library and giving constraints such
as maximum fan out for the connectors
2 Logic synthesis: The execution of this script invokes a logic synthesis tool to
generate structural VHDL files respecting the given constraints
3 Delay extraction: For each macro-cell instantiated in the CGS data-path, we
extract the delays from the corresponding VHDL file produced by the logic syn-thesis For that, we have the characteristics of the standard cells and we apply the following rules for computing, in this order, the minimum and maximum propagation times, the setup and hold times:
tminI O = min
p ∈PIO ∑
(ci ,c o)∈p
propmin(ci ,c o ,l c i ,l c o)
tmaxI O = max
p ∈PIO ∑
(ci ,c o)∈p
propmax(ci ,c o ,l c i ,l c o)
(ci ,c ck)∈C (tmax I ci +tsetup cicck −t minc ckCK)
tholdI CK = max
(ci ,c ck)∈C (tmin I ci +thold cicck −t max c ckCK)
In these formulae, I, CK and O represent the macro-cell inputs and outputs,Px y
is the set of paths from the port x to the port y, a path p being a sequence of
couples of ports of the same cell C is a set of couples of input ports of the
same cell having setup and hold times propmin and propmax are the functions characterizing the standard cells taking into account the input and output loads
(lc i ,l c o)
Of course this step may be quite long for large data-paths For this reason, UGH gives the possibility to bypass the mapping during design tuning and instead uses pessimistic estimated delays
Currently, this delay extraction is implemented for the Synopsys tools Further-more, even though the backend tools use VHDL, they use different VHDL dialects This requires to adapt the mapping tool to the backend
10.5 Fine Grain Scheduling
The arrow in the Y chart of the Fig 10.2c represents FGS It shows that its job is to retime [21] a FSM
We illustrate the algorithm on a small example The Fig 10.9 presents the inputs of Fine Grain Scheduler: (1) a data-path with known electrical (Fig 10.9a); (2) the RTL instructions directly extracted from the CG-FSM control-steps
Trang 6c1
h
h x
y
S
c0
a) Data-path
t0: r0=f(x0, y0) t2: r0=f(x0, c0) t1: y =h(c1, g(y0, r0)) t3: x =h(c1, r0)
b) Ordered list of transfers
Fig 10.9 Inputs of the FGS algorithm
(Fig 10.9b), those are called transfers, and their order matters; (3) a running frequency
FGS deals with the scheduling of basic-blocks As a reminder, a basic-block is
a sequence of RTL instructions without any control statements, except optionally the last one Furthermore, in the global program, there is no branch instruction that jumps in the basic block, except at its beginning
The idea behind FGS is to reorganize the basic-blocks of the CG-FSM, mov-ing instructions from one control-step to either a close control-step or to an added control-step, and then suppressing the useless control-steps
10.5.1 Definitions
Transfer A transfer is the motion of data from the outputs of a set of registers to
the input of a target register
A transfer t is represented as a DAG, D t (V t ,A t), whose vertices are operations
and arcs are data dependencies as realized on the data-path The Fig 10.10a shows
the DAG of the t0 transfer of Fig 10.9 In this DAG, the rectangles represent the
output of the control unit (memorized in the micro-instruction register MIR), and the circles represent functional operations There are three kind of vertices:
COP Concurrent OPerations do not modify the state of the data-path For instance,
changing the selection command of a multiplexer in a control-step only assigns MIR The next control-step may restore the previous value and so restore the circuit in the previous state They correspond to a value on bit fields of MIR Two COPs are equivalent if they match the same bit field
POP Permanent OPerations always perform the same task and are associated to a
single functional resource
SOP Sequential OPerations modify the state of the data-path They perform
mem-orization operation: Once done, the overwritten value is lost They usually correspond to a data-path register, and a bit field of MIR Two SOPs are equivalent if they match the same bit field
Trang 7permanent operation concurrent operation sequential operation
r0
y0
f
m
S=0 x0
m
m
f
c0
f
h
y
h
x c1
t1
t2 t0
t3
a) DAG of the t0 transfer b) transfer graph of Figure 1.9
Fig 10.10 Transfer DAG and transfer graph
A transfer D t (V t ,A t) has the following structural properties:
– V source t the set of vertices that have no predecessors V source t only contains COP and SOP
– V sink t the set of vertices that have no successors.|V t
sink | = 1 and its element is a
SOP
– V t
operator = V t − (V t
source ∪V t
sink ) All elements of V t
operatorare POPs
Transfer Graph A transfer graph is a directed acyclic graph, D (V,A), that
rep-resents the set of transfers that occur in the data-path for a given top level FSM transition The transfer graph is the concatenation of all transfers of the input list
in the list order (Fig 10.9b) The transfer D t is added to the graph, and the vertices
v ∈ V j
source are merged to the most recently added equivalent vertices Fig 10.10b shows the transfer graph resulting of the example of Fig 10.9
Characterized Transfer A characterized vertex is a vertex annotated with delays
(see Fig 10.11a)
A POP vertex has a value associated to each couple of incoming and out-going arcs of the vertex These values represent the set of propagation times of the corresponding physical cell
A COP vertex has only one value associated to the outgoing arc, it corresponds
to the propagation time from the clock to the MIR output bits associated to the COP
A SOP vertex has two values associated to each incoming arc and one for each outgoing arc They represent the set-up and hold times from the input relative to the clock and the propagation time from the clock to the output from the corresponding physical cell
These values are delays extracted from the physical placed and routed data-path,
so wire delays are taken into account
Trang 8propagation time setup time hold time
r0
a) Characterized vertex b) Characterized DAG of t0
Fig 10.11 Characterized vertex
The characterized transfer is obtained by replacing the original transfer vertices
by characterized vertices Figure 10.11b shows the characterized transfer of the tran-sfer presented Fig 10.10a The values of the characterized vertices are graphically represented by the length of the plain arrows
Characterized Transfer Graph It is obtained from the transfer graph by
replac-ing transfers with characterized transfers Nevertheless other arcs must be added to correctly model the behavior of the initial transfer sequence These arcs implement
the WaR and WaW precedence relations.
– The RaW relation denotes the usual data dependencies.
– The WaR relation expresses the fact that two equivalent COPs are used with dif-ferent values In our example this occurs for S = 0 in the t0 transfer and S = 1 in
the t2 transfer S can be set to 1 only when this will not disturb the t0 transfer This gives the arc from the r0 hold time to S= 1 propagation time
– The WaW relation indicates that two equivalent SOPs are used within two differ-ent transfers In the example, r0 is used simultaneously in t0 and t2 The SOP of t2 must be performed after the SOP of t0, because the same register cannot be
loaded twice in the same cycle
The resulting graph is plotted in Fig 10.12a, with the previous relations outlined
10.5.2 Scheduling of a Basic Block
The scheduling rules are:
R1 Load a given register only once in a given cycle
R2 Loading a register must respect its set-up time
R3 Loading a register must not violate the hold time
The clock period defines a grid on the which the SOPs and the COPs must be
snapped A simple ASAP [19] algorithm with the constraint that all arcs point
down-wards (Fig 10.12b) produces a scheduling that verifies the scheduling rules This
Trang 9y0 S=0 x0 S=1
h
m m f
r0 WAW WAR
m m
f
r0
y
WAR g
x h
h
x h
m m f r0
y g
y0 S=0 x0
S=1
m m f r0 WAR WAW
WAR
cycle 1
cycle 2 cycle 0
cycle 3
a) Characterized transfer graph b) Scheduled transfer graph
Fig 10.12 Characterized and scheduled transfer graph
pointing downwards relation is either combinational when concurrent operations
are involved or sequential when a permanent operation is involved
Rule R1 is enforced by the arcs implementing the WaW relations Rule R2
is enforced by RaW relations (data dependencies: the plain arrows) Rule R3 is enforced by the the arcs implementing the WaR relations.
This scheduling allows all kinds of chaining and especially multi-cycles chaining without intermediate memorizations
The only delays not taken into account are the propagation times from the FSM state register to the data-path This is solved because the control unit is a Moore FSM with a MIR that synchronizes the control signals and that we assume that the delays due to routing capacitances between the MIR and the operator command inputs are similar This can be ensured by increasing the fan-out of the MIR buffers
10.5.3 Optimization of the WaR Relations
The Fig 10.13 represents the scheduling of our example using a double frequency
comparatively to the scheduling given Fig 10.12b, only the t0 transfer and the beginning of t2 transfer are represented.
In the FGS implementation, the arcs expressing the WaR relation do not start from the hold time of the SOP but from the hold time minus mt (see Fig 10.13a) where mt is the minimum propagation time from the COP (S = 0) to the SOP (r0).
Trang 10S=1 f
y0 S=0 cycle 0
cycle 1
cycle 2 r0
WAR
mt
f
y0 S=0
S=1
cycle 0
cycle 2 r0
mt
WAR
Fig 10.13 Optimized scheduling of transfer graph
This allows when mt is large enough to anticipate the scheduling of the COP (S= 1)
as shown on the Fig 10.13b and so to get a better scheduling
Furthermore, this feature allows to automatically schedule wave pipeline [12] provided that minimum and maximum probation times are close
10.5.4 Scheduling Quality
The list order is used to set the WaR and WaW relations in the characterized transfer
graph
Unfortunately, the list of transfers only gives the data dependence relations
(RaW) and thus defines only a partial order on the transfers This fact induces that
for a given list of transfers, there are in general several characterized transfer graphs, and as many different valid schedules
To taxonomize this, we introduce three relations between the transfers
T i −→ T DD j(data dependent): SOP i belongs to V source It means that Tj j uses the
result of Ti It is the classical RAW relation.
T i ←→ T SD j(sequential dependent): SOP i and SOPj are the same and there is no
direct or transitive Ti −→ T DD j relation nor Tj −→ T DD i relation It means that the same resource is used to store two results of potentially concurrent transfers
T i
CD
←→ T j(concurrent dependent): there is a COP element of both V source i and
V source j which selects a different value and there is no direct or transitive Ti −→ T DD j relation nor Tj −→ T DD irelation It means that the same functional operator is used
in both transfers but performs different functions in each transfer
These relations allow to define three transfer graph classes
Sequential-ordered transfer graph: This is the initial data, with all the−→, DD SD
←→
and←→ relations, CD