14.2.2 Scheduling Techniques The HW waste present in conventional implementations can be reduced in the scheduling phase if we balance the computational cost of the operations executed p
Trang 1Table 14.1 Schedule and binding based on operation fragmentations
Operation Fragmentation Cycle 1 Cycle 2 Cycle 3
I5 = (‘000’ & I48 4) + I3 ⊕8
N= L × M N1= L × M11 8 ⊗8 × 4
R = P + Q R1 = P11 0+ Q11 0 ⊕12
R2 = P23 12+ Q23 12 ⊕12
Fig 14.2 Execution of operation N = L × M in cycle 3
Trang 2Table 14.2 Main features of the synthesized implementations
Datapath FUs ⊗12 × 8, ⊗4 × 4, ⊕24 ⊗8 × 4, 2 ⊗ 4 × 4, ⊕12, 2 ⊕ 8 –
Circuit area 3,324 inverters 2,298 inverters 30.86
In the second implementation of this example every datapath FU is used once per cycle to execute one operation of its same width, and therefore the HW waste present in the conventional implementation has been totally removed The circuit area (including FUs, registers, multiplexers, and controller) and the cycle length have been reduced around 31 and 7%, respectively, as summarized in Table 14.2
14.2 Design Techniques to Reduce the HW Waste
The first step to reduce the HW waste present in conventional implementations becomes the extraction of the common operative kernel of operations in the
behavi-oural specification This implies the fragmentation of compatible operations (that
can be executed over FUs of a same type) into their common operative kernel plus some glue logic These transformations can be applied not only to trivial cases like the compatibility between additions and subtractions, but to more complex ones like the compatibility between additions and multiplications The common oper-ative kernel extraction increments the number of operations that can be executed over the same HW resources, and in consequence, helps augment the FUs reuse
In addition to the extraction of the common operative kernels, several other advanced design techniques can be applied during the scheduling or allocation phases to further reduce the HW waste These new methods handle specification operations at the bit level, and produce datapaths where the number, type, and width
of the HW resources do not depend on the number, type, and width of the specifica-tion operaspecifica-tions and variables This independence from the descripspecifica-tion style used in the specification brings the applicability of HLS closer to inexperienced designers, and offers expert ones more freedom to specify behaviours
14.2.1 Kernel Extraction of Compatible Operations
Compatible operations can be executed over the same FUs in order to improve HW reuse, provided that their common operative kernel has been extracted in advance This kernel extraction can be applied to the behavioural specifications as a pre-processing phase prior to the synthesis process This way, additive operations such
as subtractions, comparisons, maximums, minimums, absolute values, data format converters, or data limiters can be substituted for additions and logic operations
Trang 3And more complex operations like multiplication, division, MAC, factorial or power operations can be decomposed into several multiplications, additions, and some glue logic
In order to increase the number of operations that may share one FU, it is also desirable the unification of the different representation formats used in the specifica-tion Signed operations can be transformed into several unsigned ones, e.g a two’s
complement signed multiplication of m × n bits (being m ≥ n) is transformed
using a variant of the Baugh and Wooley algorithm [1] into one multiplication of
(m − 1) × (n − 1) bits, and two additions of m and n + 1 bits Some of the
transfor-mations that can be applied to homogenize operation types and data representation formats have been described by Molina et al [2]
14.2.2 Scheduling Techniques
The HW waste present in conventional implementations can be reduced in the scheduling phase if we balance the computational cost of the operations executed per cycle (number of bits of every different operation type) instead of the number of operations In order to obtain homogeneous distributions, some specification oper-ations must be transformed into a set of new ones, whose types and widths may be different from the original
Operation fragments inherit the mobility of the original operation and are sched-uled separately, in order to balance the computational cost of the operations executed per cycle Therefore, one original operation may be executed across a set of cycles, not necessarily consecutive (saving the partial results and carry information calcu-lated in every cycle) This implies that every result bit is available to be used as input operand the cycle it is calculated in, even if the operation finishes at a later cycle Also each fragment can begin its execution once its input operands are available, even if its predecessors finish at a later cycle It can also occur that the first opera-tion fragment executed is not the one that uses the LSB of the input operands This feature does not imply that the MSB of the result can be calculated before its LSB
in the case of operations with rippling effect
The set of different fragmentations to be applied in every case mainly depends
on the type and width of the operation In the following sections a detailed anal-ysis of the different ways to fragment additions and multiplications is presented Other types of operations could also be fragmented in a similar one However, its description has been omitted as most of them can be reduced to multiplica-tions, addimultiplica-tions, and logic operations and the remaining ones are less common in behavioural specifications
14.2.2.1 Fragmentation of Additions
In order to improve the quality of circuits, some specification additions can be fragmented into several smaller additions and executed across several not neces-sarily consecutive cycles In this case the new data dependences among operation
Trang 4fragments (propagation of carry signals) obliges to calculate the LSB of the oper-ation in the earliest cycle and the MSB in the latest one The applicoper-ation of this design technique provides many advantages in comparison to previous ones, such
as chaining, bit-level chaining, multicycle, or non-integer multicycle Some of them are summarized below:
(a) Area reduction of the required FUs The execution of the new operation
frag-ments can be performed using a unique adder The adder width must equal the width of the biggest addition fragment, which is always narrower than the origi-nal operation length In general, the routing resources needed to share the adder among all the addition fragments do not waste the benefits achieved by the FU reuse Aside from the remaining additions present in the specification, the best area reduction is obtained when all the fragments have similar widths due to the
HW waste minimization, which is also applicable to the routing resources
In the particular case that the addition fragments are scheduled in consecu-tive cycles, this design technique may seem similar to the multicycle technique However, the main difference resides on the required HW, as the width of the multicycle FU must match the operation width, meanwhile in our case must equal the widest fragment
(b) Area reduction of the required storage units As well as all the input operand bits
of a fragmented addition are not required in the first of its execution cycles, the already summed up operand bits are not longer necessary in the later ones That
is, every input operand bit is needed just in one cycle, and it has to be stored only until that cycle The storage requirements are quite smaller than when the entire operation is executed at a later cycle, because this implies to save the complete operands during several cycles In the case of multicycle operators, the input operands are necessary all the cycles that last its execution
(c) Cycle length reduction The reduction of the addition widths, consequence of
the operation fragmentations, directly involves a reduction of their delays, and
in many cases, it also results in considerable cycle length reductions However, this reduction is not achieved in a forthright manner, because it also depends
on both the schedule of the remaining specification operations and the set of operations chained in every cycle
(d) Chaining of uncompleted operations Every result fragment calculated can be
used as input operand in the cycle it has been computed in This allows the beginning of a successor operation in the same cycle This way, the calculus of addition fragments is less restrictive than the calculus of complete operations, and therefore extends the design space explored by synthesis algorithms
Figure 14.3 shows graphically the schedule of one n bits addition in two incon-secutive cycles The m LSB are calculated in cycle i, and the remaining ones in cycle
i + j The input operand bits computed in cycle i are not required in the following
ones, and thus the only saved operands are A n −1 m and B n −1 m A successor
oper-ation requiring A + Bm −1 0 can begin as soon as cycle i Although this example
corresponds to the most basic fragmentation of one addition in two smaller ones, many others are also possible, including the ones that produce a bigger number of new operations
Trang 5Fig 14.3 Execution of one addition in two inconsecutive cycles
14.2.2.2 Fragmentation of Multiplications
The multiplications can also be fragmented in smaller operations to be executed in several not necessarily consecutive cycles, with the same aim of reducing the circuit area In this case, two important differences with the addition fragmentation arise: – The fragmentation of one multiplication not only produces smaller tions but also additions, needed to compose the final result from the multiplica-tion fragments
– The new data dependences among the operation fragments also oblige to calcu-late the LSB before the MSB However, this need does not imply that the LSB must be calculated in the first cycle in this case In fact, the fragments can be computed in any order as there are not data dependences among them Only the addition fragments that produce the result bits of the original multiplication must keep their order to propagate adequately the carry signals
The advantages of this design technique are quite similar to the ones mentioned in the above section dealing with additions For this reason just the differences between them are cited in the following paragraphs
(a) Area reduction in the required FUs The execution of the new operation
frag-ments can be performed using a multiplier and an adder, whose widths must equal the widths of the biggest multiplication and addition fragments, respec-tively The total area of these two FUs and the required routing resources to share them across its execution cycles is usually quite smaller than the area of one multiplier of the original multiplication width
(b) Area reduction in the required storage units Although several multiplication
fragments may require the same input operands, they are not needed all along every cycle of its execution Every input operand bit may be available in sev-eral cycles, but it only needs to be stored until the last cycle it is used as input operand The storage requirements are reduced compared to the execution of the entire operation at a later cycle, or the use of a multicycle multiplier, where the operand bits must be kept stable during the cycles needed to complete the operation
Trang 6Fig 14.4 Executions of one multiplication in three inconsecutive cycles
Figure 14.4 illustrates two different schedules of one m × n bits multiplication in
three inconsecutive cycles The LSB are calculated in cycle i + j, and the remaining
ones in cycle i + j +k Note that in both cases the multiplication fragment scheduled
in cycle i does not produce any result bit However, if the execution cycles of the fragments assigned to cycles i and i + j were changed, the schedules would produce
result bits in cycles i and i + j + k In both schedules some of the input operand bits
computed in cycle i are not required in cycle i + j, and none of them are required
in cycle i + j + k Although these examples correspond to basic fragmentations of
one multiplication into two smaller ones and one addition, other more complex ones (that produce more fragments) are also possible
14.2.3 Allocation and Binding Techniques
The allocation techniques to reduce the HW waste are focused on increasing the bit level reuse of FUs The maximum reuse of functional resources occurs when all the datapath FUs execute an operation of its same type and width in every cycle (no datapath resource is wasted) If the allocation phase takes place after the scheduling, then the maximum degree of FUs reuse achievable depends on the given schedule Note that the maximum bit level reuse of FUs can only be reached once a totally homogeneous distribution of operation bits among cycles is provided
In order to get the maximum FUs reuse for a given schedule, the specification operations must be fragmented to obtain the same number of operations of equal type and width in every cycle These fragmentations include the transformations of operations into several simpler ones whose types, representations, and widths may
be different from those of the original These transformations imply that one original operation will be executed distributed over several linked FUs Its operands will also
be transmitted and stored distributed over several routing and storage resources, respectively The type of fragmentation performed is different for every operation type
Trang 7Fig 14.5 Execution of one addition in two adders, and two additions in one adder
14.2.3.1 Fragmentation of Additions
The minimum HW waste in datapath adders occurs when all of them execute addi-tions of their same width at least in one cycle For every different schedule there may exist several implementations with the minimum HW waste, and thus sim-ilar adders area The main difference resides on the fragmentation degree of the implementations These circuits may present some of the following features:
(a) Execution of one addition over several adders The set of adders used to execute
every addition must be linked to propagate the carry signals
(b) Execution of one addition over a wider adder This requires the extension of
the input operands and the discard of some result bits This technique is used to avoid the excessive fragmentation in cycles with smaller computational cost
(c) Execution of several additions in the same adder Input operands of different
operations must be separated with zeros to avoid the carry propagation, and the corresponding bits of the adder result discarded At least one zero must be used
to separate every two operations, such that the minimum adder width equals the sum of the additions widths to the number of operations minus one
Figure 14.5 shows the execution of one addition over two adders as well as the execution of two additions over the same adder These trivial examples provide the insight of how these two design techniques can be applied to datapaths with more adders and operations
14.2.3.2 Fragmentation of Multiplications
Like additions, the minimum HW waste in datapath multipliers occurs when all of them calculate multiplications of their same width at least in one clock cycle Some
of the desirable features of one implementation with minimum HW waste are:
(a) Execution of one multiplication over several multipliers and adders The set of
multipliers used to execute every multiplication must be connected to adders
Trang 8Fig 14.6 Execution of one multiplication over two multipliers and one adder, and two
multiplica-tions in one multiplier
in some way (either chained or via intermediate registers) in order to sum the partial results and get the final one
(b) Execution of one multiplication over a wider multiplier This technique is used
to avoid the excessive fragmentation of multiplications in cycles with smaller computational cost
(c) Execution of several multiplications in the same multiplier The input operands
of different operations must be separated with several zeros to avoid interfer-ences in the calculus matrix of the different multiplications As the amount of zeros needed equals the operands widths, this technique seems only appropriate when the computational costs of the multiplications scheduled in every cycle are quite different
Figure 14.6 shows the execution of one multiplication over two multipliers and one adder as well as the execution of two multiplications over the same multiplier Note here that the unification of operations to be executed in the same FU produces
an excessive internal waste of the functional resource, what makes this technique unviable in most designs By contrast, the fragmentation does not produce any addi-tional HW waste, and can be used to reduce the area of circuits with some HW waste
in all the cycles
14.3 Applications to Scheduling Algorithms
The scheduling techniques proposed to reduce the HW waste can be easily imple-mented in most HLS algorithms This requires the common kernel extraction of specification operations and its successive transformations into several simpler ones
to obtain balanced distributions in the number of bits computed per cycle
Trang 9This section presents a variant of the classical force-directed scheduling algo-rithm proposed by Paulin and Knight [4] that includes some bit-level design tech-niques to reduce the HW waste [3] The intent of the original method is to minimize the HW cost subject to a given time constraint by balancing the number of operations executed per cycle For every different type of operations the algorithm successively selects, among all operations and all execution cycles, an (operation, cycle) pair according to an estimate of the circuit cost called force
By contrast, the intent of the proposed variant is to minimize HW cost by bal-ancing the number of bits of every different operation type calculated per cycle This method successively selects, among all operations (multiplications first and additions afterwards) and all execution cycles, a pair formed by an operation (or
an operation fragment) and an execution cycle, according to a new force
defini-tion that takes into account the widths of operadefini-tions It also uses a new parameter,
called bound, to decide whether an operation should be either scheduled complete,
or fragmented and just one fragment scheduled in the selected cycle
14.3.1 Force Calculation
Our redefinition of the force measure used in the classical force-directed algo-rithm does not consider all operations equally Instead, it gives a different weight
to every operation in function of its computational cost, leading to a more uniform distribution of the computational costs of operations among cycles
The proposed algorithm calculates the forces using a set of new distribution graphs (DGs) that represent the distribution of the computational cost of operations
For each DG, the distribution in clock cycle c is given by:
DG(τ,c) = ∑
op ∈EOPτ
c (COST(op) · P(op,c)),
where
• COST(op): computational cost of operation op For additions it is defined as the
width of the widest operand, and for multiplications as the product of the widths
of its operands
COST(op) =
Max(width(x),width(y)) op ≡ x + y
width(x) · width(y) op ≡ x × y
• P(op,c): probability of scheduling operation op in cycle c It is similar to the
classical method
c (estimated operations of typeτ in cycle c): set of operations of typeτ
whose mobility makes their scheduling possible in cycle c.
Except for the DG redefinition, the force associated with an operation to cycle binding is calculated like the classical method The smaller or more negative the
Trang 10force expression is, the more desirable an operation to cycle binding becomes In each iteration, the algorithm selects the operation and cycle with the lowest force,
thus scheduling first operations with less mobility in cycles with smaller sum of computational costs, i.e with smaller DG(τ,c).
14.3.2 Bound Definition
Our algorithm binds operations (or fragments) to cycles subject to an upper bound that represents the most uniform distribution of operation bits of a certain type among cycles reachable in every moment It is used to decide if an operation should
be fragmented and how Its initial value corresponds to the most uniform distribution
of operation bits of typeτamong cycles, without data dependencies:
bound=
∑
op ∈OPτ COST(op)
where
OPτ: set of specification operations of typeτ
λ: circuit latency
The perfect distribution defined by the bound is not always reachable due to data dependencies among operations and given time constraints Therefore it is updated during the scheduling in the way explained below
If the number of bits already scheduled in the selected cycle c plus the
compu-tational cost of the selected operation op is equal or lower than the bound, then the operation is scheduled there
where
cycle c
op ∈SOPτ
c
COST(op).
c: set of operations of typeτ scheduled in cycle c.
Otherwise the operation is fragmented and one fragment scheduled in the selected cycle (the remaining fragments continue unscheduled) The computational cost of
the fragment to be scheduled CCostFrag equals the value needed to reach the bound
in that cycle
CCostFrag = bound − CCS(τ,c).
In order to avoid a reduction in the mobility of the predecessors and successors of the fragmented operation, they are also fragmented