High Level Synthesis: from Algorithm to Digital Circuit- P18 potx

• The cost to access data in a register is lower than the cost to access data in memory because of the necessity to compute the address.. It can be either stored in a memory bank of the

Trang 1

158 P Coussy et al.

Fig 9.9 Operator area vs.

sizing approaches

40 slices 34 slices 24 slices

*

9

*

9

8

*

4

9

Max(8,4,3, 9) Max(in 1,in 2) Best(in 1,in 2)

40 slices 34 slices 24 slices

*

9

*

9

8

*

4

9

Max(8,4,3, 9) Max(in 1,in 2) Best(in 1,in 2)

9.3.2.5 Storage Element Optimization

Because currently there is no feed-back loop in the design flow, the registers opti-mization has to be done during the conception of the processing unit The choice of the location of an unconstrained variable (user can define the location of variables)

in a register or in a memory, has to be done according to the minimization of two contradictory cost criteria:

• The cost of a register is higher than the cost of a memory point.

• The cost to access data in a register is lower than the cost to access data in

memory (because of the necessity to compute the address)

Two criteria are used to choose the memorization location of the data:

• A variable whose life time is inferior to a locality threshold is stored in a register.

• The location of memorization depends on the class of the variable.

Data are classified into three categories:

• Temporary processing data (declared or undeclared).

• Constant data (read-only).

• Ageing data (which serves to express the recursivity of the algorithm to be

synthesized, via their assignment after having been utilized)

The optimal storage of a given data element depends upon its declaration and its life time It can be either stored in a memory bank of the MEMU or in a storage element of the processing unit PU The remaining difficulty lies in selecting an optimal locality threshold which results in minimizing the cost of the storage unit The synthesis tool leaves the choice of the value of the locality threshold up to the user In order to help the designer, GAUT proposes a histogram of the life time of the variables, normalized by the utilization frequency, which is calculated from the scheduled DFG

The architecture of the processing unit is composed of a processing part and

a memory part (i.e memory plan) and the associated control state machine FSM (Fig 9.1) The memory part of the datapath is based on a set of strong seman-tic memories (FIFO, LIFO) and/or registers Spatial adaptation is performed by

an interconnection logic dealing with data dispatching from operators to storage elements, and from storage elements to operators Timing adaptation (data-rates, different input/output data scheduling) is realized by the storage elements Once the location of data has been decided, the synthesis of the storage elements located in

Trang 2

9 GAUT: A High-Level Synthesis Tool for DSP Applications 159

Fig 9.10 Four-step flow

RCG Construction Binding Optimization Generation

Fig 9.11 Resource compatibility

d

e

L

F R

F

F F

R L

f c

R

F F

b

the PU is done This design step inputs data lifetimes resulting from the scheduling step and spatial information resulting from the binding step of the DFG The spa-tial information is the source and destination for each data First, we formalize both timing relationships between data (thanks to data lifetimes) and spatial information through a Resource Compatibility Graph RCG This formal model is then used to explore the design space We named timing relationships and spatial information as Communication Constraints

This synthesis task is based on a four-step flow: (1) Resource Compatibility Graph (RCG) construction, (2) Storage resource binding, (3) Architecture optimiza-tion and (4) VHDL RTL generaoptimiza-tion (see Fig 9.10) During the first step of the component generation, a Resource Constraints Graph is generated from the com-munication constraints The analysis of this formal model allows both the binding

of data to storage elements (queue, stack or register), and the sizing of each storage element This first architecture is then optimized by merging storage elements that have non-overlapping usage time frames

Formal model: In order to explore the design space of such a component, the

first step consists in generating a Resource Compatibility Graph, from the com-munication constraints This RCG specifies through formal modeling the timing relationship between data that have to be handled by the datapath architecture

The vertex set V = {v0, ,v n } represents data, the edge set E = {(v i ,v j )}

repre-sents the compatibility between the data A tag t i j ∈ T is associated with each edge

(v i ,v j ) This tag represents the compatibility type between the two data (i and j),

T= {Register R, FIFO F, LIFO L}, e.g Fig 9.11.

Trang 3

160 P Coussy et al.

In order to assign compatibility tags to edges, we need to identify the timing relationship that exists between two data For this purpose we defined a set of rules based on functional properties of each storage element (FIFO, LIFO, Register) The

lifetime of data a is defined byΓ(a) = [τmin(a),τmax(a)] whereτmin(a) andτmax(a)

are respectively the date of the write access of a into the storage element, and the last date of the read access to a.τf irst(a) is the first read access to a,τRiais the ith read access to a, with first ≤ i ≤ max.

Rule 1: Register compatibility

I f(τminb ≥τmaxa ) then we create a “Register” tagged edge.

Rule 2: FIFO compatibility

I f[(τmin b >τmin a ) and (τf isrtb >τmax a ) and (τmin b <τmax a )] then we create a

“FIFO” tagged edge.

Rule 3: LIFO compatibility

I f[[(τminb >τmina ) and (τf irsta >τmaxb )] or [(τRia <τminb <τmaxb <τRi+1a )]] then

we create a “LIFO” tagged edge.

Rule 4: Otherwise, No edge – No compatibility.

An analysis of the communication constraints enables the RCG generation The graph construction supposes edge creation between data, respecting a chronologi-cal order (τmin) If n is the number of data to be handled, the graph may contain:

n (n − 1)/2 edges, O(n2)

Storage element binding: The second step consists in binding storage elements

to data thanks to the timing relations modeled by the RCG

Resource identification: The second step consists in binding storage elements to

data by using the timing relations modeled by the RCG The aim is to identify and

to bind as many FIFO or LIFO structures as possible on the RCG

Theorem 1 If a is FIFO compatible with b and b is FIFO compatible with c, then

a is transitively FIFO (or Register) compatible with c.

As a consequence of Theorem 1, a FIFO compatible datapath, PF, is by construc-tion equivalent to a FIFO compatibility clique (i.e the data of the PF path can be stored in the same FIFO)

Theorem 2 If a is LIFO compatible with b and b is LIFO compatible with c, then

a is transitively LIFO compatible with c.

As a consequence of Theorem 2, a LIFO compatible datapath, P L, is by

construc-tion equivalent to a LIFO compatibility clique (i.e the data of the P L path can be stored in the same LIFO)

Resource sizing: The size of a LIFO structure equals the maximum number of

data stored by a LIFO compatible data path So, we have to identify the longest

LIFO compatibility path P L in a LIFO compatibility tree, and then the number of

vertices in P L from the longest LIFO path in the tree equals the maximum number

of data that can be stored in it

Trang 4

d

L

f

b

FIFO3

e c

FIFO2

R

a

d

L

ff

b

FIFO3

e c

FIFO2

ee cc

FIFO2

R

aa

(a) Resulting hierarchical graph

a b

c d

time

e f

O3

O2

a b

c d

time

e f

O2

(b) Resulting constraints

Fig 9.12 A possible binding for graph

The size of a FIFO is the maximum number of data (of the considered path) stored at the same time in the structure In fact, the aim is to count the maximum

number of overlapped data (respecting I/O constraints) in the selected path P These

sizes can be easily extracted from our formal model

Resource binding: Our greedy algorithm is based on user plotted metrics

(mini-mal amount of data to use a FIFO or a LIFO, average use factor, FIFO/LIFO usage priority factor ) to bind as many FIFO or LIFO structures as possible on the RCG

A two-steps flow is used: (1) identification of the best structure, (2) merging all the concerned data in a hierarchical node

Each node represents a storage element, as shown on Fig 9.12a (e.g data a, b and

f are merged in a three-stages FIFO) We say hierarchical node because merging

a set of data in a given node, supposes adding information that will be useful dur-ing the optimization step: the lifetime of this structure (i.e the time interval durdur-ing which this structure will be used e.g Fig 9.12b)

Let P= {v0, ,v n } be a compatible data path,

• If P is a FIFO compatible path, the structure lifetime will be [τminv0,τmaxvn]

• If P is a LIFO compatible path, the structure lifetime will be [τminv0,τmaxv0]

Storage element optimization: The goal of this final task is to maximize

stor-age resource usstor-age, in order to optimize the resulting architecture by minimizing the number of storage elements and the number of structures to be controlled To tackle this problem, we built a new hierarchical RCG by using the merged nodes, and their lifetimes In order to avoid any conflict, the exploration algorithm of the optimization step will only search for Register compatibility path, between same type vertices When two structures of the same type are Register compatible, they can be merged

Let P= {v0 v n } be a Register compatible data path,

• The lifetime of the resulting hierarchical merged structure will be [τminv0,τmaxvn]

U U [τminvn,τmaxvn]

The algorithm is very similar to the one used during binding step When there is

no more merging solution, the resulting graph is used to generate the RTL VHDL

Trang 5

162 P Coussy et al.

Fig 9.13 Optimization

of Fig 9.11 graph

a

f b FIFO3

d

e c FIFO2

a

f b

FIFO3

aa

ff b FIFO3

d

e c FIFO2 d

ee cc FIFO2

architecture Figure 9.13 is a possible architectural solution for the Resource Com-patibility Graph presented in Fig 9.11 Here, the resulting architecture consist in a three-stages FIFO that handles three data, and a two-stages FIFO that handles three data: one memory place has been saved

9.3.3 Memory Unit Synthesis

In this section, we present two major features of GAUT, regarding the memory sys-tem First the data distribution and placement are formalized as a set of constraint for the synthesis We introduce a formal model for the memory accesses, and an accessibility criterion to enhance the scheduling step Next, we propose a new strat-egy to implement signals described as ageing vectors in the algorithm We formalize the maturing process and explain how it may generate memory conflicts over sev-eral iterations of the algorithm The final Compatibility Graph indicates the set of valid mappings for every signal Our scheduling algorithm exhibits a relatively low complexity that allows to tackle complex problems in a reasonable time

9.3.3.1 Memory Constrained Scheduling

In our approach the data flow graph DFG first generated from the algorithmic speci-fication is parsed and a memory table is created This memory table is completed by the designer who can select the variable implementation (memory or register) and place the variable in the memory hierarchy (which bank) The resulting table is the memory mapping that will be used in the synthesis It presents all the data vertices

of the DFG The data distribution can be static or dynamic

In the case of a static placement, the data remains at the same place during the whole execution If the placement is dynamic, data can be transferred between different levels in the memory hierarchy Thus, several data can share the same loca-tion in the circuit memory The memory mapping file explicitly describes the data transfers to occur during the algorithm execution

Direct Memory Address (DMA) directives will be added to the code to achieve these transfers The definition of the memory architecture will be performed in the first step of the overall design flow To achieve this task, advanced compilers such

as Rice HPF compiler, Illinois Polaris or Stanford SUIF could be used [14] Indeed, these compilers automatically perform data distribution across banks, determine

Trang 6

Fig 9.14 Memory constraint

x1

x2

x3

h3

h2

h1

h0

x0

x1

x2

x3

h3

h2

h1

h0

which access goes to which bank, and then schedule to avoid bank conflicts The Data Transfer and Storage Exploration (DTSE) method from IMEC and the associ-ated tools (ATOMIUM, ADOPT) are also a good mean to determine a convenient data mapping [15]

We modified the original priority list (see Sect 9.3.2.2) to take into account the memory constraint: an accessibility criterion is used to determine if the data involved by an operation is available, that is to say, if the memory where it is stored

is free Operations are still listed according to the mobility and bit-width criterion, but all operations that do not match the accessibility criterion are removed Every operation that needs to access a busy memory will not be scheduled, no matter its priority level Fictive memory access operators are added (one access operator per access port to a memory) The memory is accessible only if one of its access oper-ators is idle Memory access operoper-ators are represented by tokens on the Memory Constraint Graph (MCG): there are as many tokens as access ports to the memory

or bank Figure 9.14 shows two MCG, for signal samples x[0] to x[3] stored in bank 1, and coefficients h[0] to h[3] stored in bank 2 (in the case of a four points convolution filter for instance)

If one bank is being accessed, one token is placed on the corresponding data Only one token is allowed for a one port bank Dotted edges indicate which follow-ing access will be the faster In the case of a DRAM indeed, slower random accesses are indicated with plain edges and faster sequential accesses with dotted edges Our scheduling algorithm will always favor fastest sequences of accesses whenever it has the choice

9.3.3.2 Implementing Ageing Vector

Signals are the input and output flows of the applications A mono-dimensional

signal x is a vector of size n, if n values of x are needed to compute the result Every cycle, a new value for x (x [n + 1]) is sampled on the input, and the oldest value of x

(x [0]) is discarded We call x an ageing, or maturing, vector or data Ageing vectors

are stored in RAM A straightforward way to implement, in hardware, the maturing

of a vector, is to write its new value always at the same address in memory, at the end of the vector in the case of a 1D signal for instance Obviously, that involves

to shift every other values of the signal in the memory to free the place for the new

value This shifting necessitates n reads and n writes, which is very time and power

consuming In GAUT, the new value is stored at the address of the oldest one in the

Trang 7

164 P Coussy et al.

x(0) x(1) x(2) x(3)

3 2 1 0

x[3]

x[2]

x[1]

x[0]

Iteration 0

x(1) x(2) x(3) x(4)

2 1 0 3

x[3]

x[2]

x[1]

x[0]

Iteration 1

x(2) x(3) x(4) x(5)

1 0 3 2

x[3]

x[2]

x[1]

x[0]

Iteration 2

x(3) x(4) x(5) x(6)

0 3 2 1

x[3] x[2] x[1]

x[0]

Itération 3

Logical address @x[]

Algorithm sample x[]

3 2 1 0

x[3]

x[2]

x[1]

x[0]

x(0) x(1) x(2) x(3)

3 2 1 0

x[3]

x[2]

x[1]

x[0]

x(0) x(1) x(2) x(3)

3 2 1 0

x[3]

x[2]

x[1]

x[0]

x(1) x(2) x(3) x(4)

2 1 0 3

x[3]

x[2]

x[1]

x[0]

x(1) x(2) x(3) x(4)

2 1 0 3

x[3]

x[2]

x[1]

x[0]

x(1) x(2) x(3) x(4)

2 1 0 3

x[3]

x[2]

x[1]

x[0]

x(2) x(3) x(4) x(5)

1 0 3 2

x[3]

x[2]

x[1]

x[0]

x(2) x(3) x(4) x(5)

1 0 3 2

x[3]

x[2]

x[1]

x[0]

x(2) x(3) x(4) x(5)

1 0 3 2

x[3]

x[2]

x[1]

x[0]

x(3) x(4) x(5) x(6)

0 3 2 1

x[3] x[2] x[1]

x[0]

Itération 3

x(3) x(4) x(5) x(6)

0 3 2 1

x[3] x[2] x[1]

x[0]

x(3) x(4) x(5) x(6)

0 3 2 1

x[3] x[2] x[1]

x[0]

Itération 3

Fig 9.15 Logical addresses evolution for signal x

@x[0] @x[1] @x[2] @x[3]

-3, 1

@x[0] @x[1] @x[2] @x[3]

-3, 1

@x[j] i -1 @x[j] i+1

@x[j] i @x[j] i+1

@x[j] i -1 @x[j] i+1

@x[0] @x[1] @x[2] @x[3]

-3, 1, -1

@x[0] @x[1] @x[2] @x[3]

-3, 1, -1

@x[0] @x[1] @x[2] @x[3]

-3, 1, -1

Fig 9.16 LAG, AG and USG

vector Only one write is needed Obviously, the address generation is more difficult

in this case, because the addresses of the samples called in the algorithm change from one cycle to the other Figure 9.15 represents the evolution of the addresses for

a L= 4 points signal x from one iteration to the other

The methodology that we propose to support the synthesis of these complex

log-ical address generators is based on three graphs (see Fig 9.16) The loglog-ical address

graph (LAG) traces the evolution of the logical addresses for a vector during the execution of one iteration of the algorithm Each vertices correspond to the logical

address where samples of signal x are to be accessed Edges are weighted with two

numbers The first number, f i j, indicates how the logical address evolves between

two successive accesses to vector x f i j = ( j − i)%L (% indicates the modulo) The

second number g i , jindicates the number of iteration between those two successive

accesses

To actually calculate the evolution of logical addresses of x from one iteration to the other, we must take into account the ageing of vector x We introduce the ageing

factor k as the difference between the logical address of element x [i] at the iteration

o and the logical address of element x [i] at the iteration o + 1, so that:

@x[j]i+1= (@x[j]i− k)%L.

In our example, k = 1 The Ageing Graph (Fig 9.16) is another representation of

this equation We finally combine the LAG and the ageing factor to get the Unified

Sequences Graph (USG) (Fig 9.16) A detailed definition of those three graphs may

be find in [16]

By moving a token in the USG, and by adding to the first logical address for x

the value of weight f i , j minus the ageing factor k, we get the address sequence for

x during the complete execution of the algorithm Then, the corresponding address generator is generated

If a pipelined architecture is synthesized, the ageing factor k is multiplied by

the number of pipeline slices, and as many tokens as pipeline slices are placed and

Trang 8

moved in the USG Of course, as much memory locations as supplemental tokens

in the USG must be added to guarantee data consistency Concurrent accesses to elements of vector x may appear in a pipelined architecture While moving tokens in

the USG, a Concurrent Accesses Graph is constructed This graph is finally colored

to obtain the number of memory banks needed to support access concurrency

9.3.4 Communication and Interface Unit Synthesis

9.3.4.1 Latency Insensitive Systems

Systems on a chip (SoCs) are the composition of several sub-systems exchanging data SoC size increase is such that an efficient and reliable interconnection strat-egy is now necessary to combine sub-systems and preserve, at an acceptable design cost, the speed performances that the current very deep sub-micron technologies allow [20] This communication requirement can be satisfied by a LIS communi-cation network between hardware components The LIS methodology enables to build functionally correct SoCs by (1) promoting pre-developed components inten-sive reuse (IPs), (2) segmenting inter-components interconnects with relay stations

to break critical paths and (3) bringing robustness to data stream latencies to com-ponents by encapsulating them into synchronization wrappers These encapsulated blocks are called “patient processes” Patient processes [21] are a key element in the

LIS theory They are suspendable synchronous components (named pearls)

encap-sulated into a wrapper (named shell) which function is to make them insensible to the I/O latency and to drive the clock The decision to drive or not the component’s clock is implemented with combinatorial logic The LIS approach relies on a sim-plifying, but restricting, assumption: a component is activated only if all its inputs are valid and all its outputs are able to store a result produced at the next clock cycle Now, it is frequent that only a subset of the inputs and outputs are necessary

to execute one step of computation in a synchronous block

To limit the patient process sensitivity to a subset of the inputs and outputs,

in [22] authors suggest to replace the combinatorial logic that drives the clock by

a Mealy type FSM This FSM tests the state of only the relevant inputs and out-puts at each cycle and drives the component clock only when they are all ready The major drawbacks of FSMs are their difficult synthesis and large silicon size when communication scenarios are long and complex like for computing intensive digital signal processing applications To reduce the hardware cost, in [23] the com-ponent activation static schedule is implemented with shift registers which contents drive the component’s clock This approach relies on the hypothesis that there are

no irregularities in the data streams: it is never necessary to randomly freeze the components

Trang 9

166 P Coussy et al.

9.3.4.2 Proposed Approach

As (1) LIS methodology lacks the ability to dynamically sense I/O subsets, (2) FSMs can become too large as communication bandwidth does, and (3) shift regis-ters based synchronization targets only extremely rapid environments, we propose

to encapsulate hardware components into a new synchronization wrapper model which area is much less than the FSM-based wrappers area, which speed is enhanced (mostly thanks to area reduction) and synthesizability is guaranteed whatever the communication schedule is

The solution we propose is functionally equivalent to the FSMs This is a specific processor that reads and executes cyclically operations stored in a memory We name

it a “synchronization processor” (SP) Figure 9.1 specifies the new synchronization wrapper structure with our SP

The SP communicates with the LIS ports with FIFO-like signals These signals

are formally equivalent to the voidin/out and stopin/out of [19] and valid, ready and stall of [22] Number of input and output ports can be any It drives the com-ponent’s clock with the enable signal The SP model is specified by a three states

FSM: a reset state at power up, an operation-read state, and a free-run state This FSM is concurrent with the component and contains a data path: this a “concur-rent FSM with data path” (CFSMD) Operation’s format is the concatenation of

an input-mask, an output-mask and a free-run cycles number The masks specify respectively the input and output ports the FSM is sensible to The run cycles num-ber represents the numnum-ber of clock cycles the component can execute until the next synchronization point To avoid unnecessary signals and save area, the memory is

an asynchronous ROM (or SRAM with FPGAs) and its interface with the SP is

reduced to two buses: the operation address and operation word The execution

of the program is driven by an operation “read-counter” incremented modulo the memory size

9.4 Experiments

Design synthesis results for Viterbi decoders are presented in this section Results are based on a Virtex-E FPGA technology from the hardware prototyping platform that we used and that we present first

9.4.1 The Hardware Platform

The Sundance platform [24] we used as an experimental support is composed of the last generation of C6x DSPs and Virtex FPGAs Communications between different functional blocs are implemented with high throughput SDB links [24] We have automated the generation of communication interface for software and hardware

Trang 10

components which frees the user from designing the communication interfaces

At the hardware level the communication between computing nodes is handled by four-phases handshaking protocols and decoupling FIFOs The handshaking pro-tocols synchronize computing with communication and the FIFOs enable to store data in order to overcome potential data flow irregularities Handshaking protocols are used either to communicate seamlessly between hardware nodes or between hardware and software nodes Handshaking protocols are automatically refined by the GAUT tool to fit with the selected (SDB) inter-node platform communication interfaces (bus width, signal names, etc) To end the software code generation, platform specific code has to be written to ensure the inter processing elements communication The communication drivers of the targeted platform are called inside the interface functions introduced in the macro-architecture model through

an API mechanism We provide a specific class for each type of link available on the platform

9.4.2 Synthesis Results

The Viterbi algorithm is applicable to a variety of decoding and detection problems which can be modeled by a finite-state discrete-time Markov process, such as convo-lutional and trellis decoding in digital communications [25] Based on the received symbols, the Viterbi algorithm estimates the most likely state sequence according

to an optimization criterion, such as the a posteriori maximum likelihood criterion, through a trellis which generally represents the behavior of the encoder The generic

C description of the Viterbi algorithm allowed us to synthesize architectures using different values for the following functional parameters: state number and through-put A part of synthesis results that have been obtained is given in Fig 9.17 For each generated architecture, the table presents the throughput constraint and the com-plexity of both the algorithm (number of operations) and the generated architecture (amount of logic elements)

In the particular case of the DVB-DSNG Viterbi decoder (64 states) different throughput constraints (from 1 to 50 Mbps) have been tested Figure 9.18 present the synthesis results

State number 8 16 32 64 128 Throughput (Mbps) 44 39 35 26 22

Number of operations 50 94 182 358 582

Number of logic

Fig 9.17 Synthesis results for different Viterbi decoders

Định dạng
Số trang	10
Dung lượng	446,19 KB