High Level Synthesis: from Algorithm to Digital Circuit- P11 potx

5.7 Creating RTL with Predictable Timing Closure One of the challenges in RTL design is to ensure that the RTL you have written will have successful timing closure through logic synthesi

Trang 1

bind(c);

}

};

Note that the addition of these functions allows the binding to be done using the conventional SystemC port binding syntax:

socket.bind(channel);

or

socket(channel);

Also note that the binding functions are defined as templates This lets the same ports and binding functions to be used for port-to-port binding in a hierarchical design

5.6 Structural Hierarchy

In addition to the process control constructs, SystemC synthesis supports the Sys-temC constructs for construction of structural hierarchies An engineering team can attack a large design problem using structural decomposition, breaking the problem down into multiple smaller modules that communicate through user-defined inter-faces Individual sub-modules can be assigned to different team members if desired supporting a conventional team structure and concurrent design approach Each module can contain any number of cooperating SC CTHREADs, SC METHODs, and sub-modules Communication between modules is achieved using a port-to-signal binding mechanism of a kind that is familiar to RTL designers, or even designers using schematics

Here is an example of a hierarchical design using modular interfaces as described previously

SC MODULE(parent)

{

// ports

sc in clk clk;

sc in<bool> rst;

RV in< sc uint<8> > din;

RV out< sc uint<8> > dout;

// submodules

sub module m sub1;

sub module m sub2;

Trang 2

// signals and channels

RV< sc uint<8> > chan;

SC CTOR(parent)

: m sub1("sub1"),

m sub2("sub2"),

chan("chan")

{

// bind first module using bind() function

m sub1.clk.bind(clk);

m sub1.rst.bind(rst);

m sub1.dout.bind(chan); // socket-to-channel

// bind second module using socket() syntax

m sub2.clk(clk);

m sub2.rst(rst);

m sub2.din(chan);

m sub2.dout(dout);

}

};

This use of SystemC constructs rather than tool constructs for implementation

of hierarchy and communication improves the overall verification process dramat-ically The complete structural hierarchy can be simulated at a behavioral level, accurately representing the concurrency of all the modules and threads, and accu-rately verifying the pin-level communication protocols between them This allows the functional verification to be performed using high-speed behavioral simulation, and eliminates the need for many slow RTL simulations

5.7 Creating RTL with Predictable Timing Closure

One of the challenges in RTL design is to ensure that the RTL you have written will have successful timing closure through logic synthesis at the specified clock rate when implemented in the chosen process technology High-level synthesis has to meet the same challenge to be practical for wide deployment

Cynthesizer achieves this by combining a number of steps First, the timing infor-mation about the cells in the target process technology library are used as an input

to the high-level synthesis process This information is read in a Liberty format lib file provided by the chosen foundry

Second, Cynthesizer has advanced datapath optimization technology that it uses

to build a library of gate-level functional units such as adders, multipliers, mul-tiplexors, etc based on the cells available in the target technology lib file These

Trang 3

functional units are optimized for a specific clock frequency, and may be imple-mented in a pipelined manner, where each pipeline stage is designed to fit within the designated clock period

Functional unit library compilation is performed in advance of high-level synthe-sis once per process technology and clock period to speed the synthesynthe-sis process All the tools needed for library compilation to be performed by the user are included with Cynthesizer No additional tool needs to be purchased

Cynthesizer also creates custom functional units as needed during high-level syn-thesis These include non-square parts (i.e., a 12-bit by 3-bit adder) as well as parts

to implement more complex expressions Cynthesizer automatically identifies use-ful expressions in the algorithm of the design (such as “a+ (b ∗ c) − 3)” and builds

gate-level parts on the fly that implement them

Third, Cynthesizer uses this detailed timing information when it schedules the operations of the algorithm to ensure that no combinatorial path in use exceeds the clock period Additional user controls are available to allow the user to adjust the “aggressiveness” with which Cynthesizer fills each clock period with logic These controls can be used to make downstream timing closure even easier, thereby reducing processing time in downstream tools such as logic synthesis

Cynthesizer produces RTL produced that has a structural character Adders, mul-tipliers, multiplexors, etc are instantiated with a finite state machine determining what values are presented to each instantiated part in each clock cycle This ensures that the timing assumptions made during high-level synthesis are maintained during logic synthesis

5.8 Scheduling

It has been noted that a primary benefit of using behavioral synthesis is the abil-ity to write clocked processes whose functionalabil-ity takes more than one clock cycle This gives the user the ability to control the latency and throughput of the result-ing circuit without performresult-ing detailed resource assignment and schedulresult-ing by hand

At the same time, I/O activity at the ports of the module being synthesized must conform to a specified protocol in order to have the synthesized block interoperate with other blocks The protocol mandates that certain relationships between I/O events must be held constant For instance, the data value must be presented on the data bus in the same cycle as the data valid line is driven to true

5.8.1 Mixed-Mode Scheduling

Cynthesizer addresses these requirements by providing a number of directives that give the user high-level control of its scheduling The Cynthesizer scheduler

Trang 4

allows different code blocks in a single SC CTHREAD to be scheduled differently according the user requirements A “code block” is defined as any section of C++ code delimited by “{” and “}.” Thus it can be a clause of an if-else statement, the

body of a loop, or any other set of statements that the user chooses to group together Note that while the protocol can be written in-line as it is shown here, protocols are typically encapsulated into modular interface classes for ease-of-use and for ease-of-reuse

SC_MODULE

SC_CTHREAD

while (1) { { CYN_PROTOCOL(“name1”);

// Get inputs }

// algorithm { CYN_PROTOCOL(“name2”);

// Write output }

}

5.8.2 Unconstrained Scheduling

To begin with, it is assumed that all the code in the design, unless otherwise iden-tified, is completely untimed, and that the scheduler of the high-level synthesis process has complete freedom to implement the functionality in as many or as few clock cycles as it chooses No guarantees of any cycle-by-cycle timing are made in this unconstrained code, although the order of operations determined by the dependency relationships within the code is maintained

By default, without any scheduling constraints, Cynthesizer will optimize for area, taking as many cycles as necessary to complete the computation with a minimal set of functional units

Trang 5

5.8.3 Scheduling for Protocol Accuracy

In order to give the user maximum control of cycle-by-cycle timing for implement-ing protocols, Cynthesizer allows the specification of cycle-accurate blocks of code

by the use of the CYN PROTOCOL directive This directive, associated with a par-ticular code block directs Cynthesizer not to insert any clock cycles within that code block except for those specified by the user with wait() statements Within these protocol blocks, scheduling ensures that the ordering of port and signal I/O and the given wait()s is held constant

For some kinds of designs, such close scheduling control is needed that it is desirable to apply a CYN PROTOCOL directive to the entire body of the while(1) loop that implements the bulk of behavior of the SC CTHREAD In this case the user precisely specifies the cycle-by-cycle I/O behavior of the design Even with this tight control, the user benefits from using high-level synthesis because the design is expressed without an explicit FSM designed by the user In many cases Cynthesizer can schedule computations and memory accesses within the constraints of the I/O schedule as well

5.8.4 Constraining Scheduling

Scheduling can be constrained to achieve specific latency targets by applying a CYN LATENCY directive to a specific code block This directs the scheduler to ensure that the behavior of the given block is to be scheduled within the number

of cycles specified by the directive The user is allowed to specify a minimum and maximum latency to be achieved

For example, consider the following design which reads in six data values and outputs a computed result The data is expressed as a structure:

struct data struct;

{

sc uint<8> A;

sc uint<8> B;

sc uint<8> C;

sc uint<8> D;

sc uint<8> E;

sc uint<8> F;

sc uint<8> G;

}

The module has a modular interface input port and a modular output port:

RV IN<data struct> in port;

RV OUT< sc uint<28> > out port;

Trang 6

The main while loop of the SC CTHREAD is:

while( true )

{

sc uint<28> X;

// read the data from the input port

struct data struct data = in port.get();

{

// do the computation in 4 cycles

CYN LATENCY( 4, 4, "algorithm latency" );

X = ( A + B + C ) * ( D + E + F ) * G;

}

// write the result to the output port

out port.put(X);

}

This can be implemented by Cynthesizer using two adders and one multiplier to perform this computation in the specified four cycles using the following schedule This produces an overall throughput of one value per six cycles

+ A

B C D E F G

*

+

+ +

* out

in in in in in in in

6

If, on the other hand a slower circuit were acceptable, a 6-cycle latency for the computation (resulting in an overall throughput of one value per eight cycles) could

be achieved by specifying:

Trang 7

*

+

+ +

*

out

A

B

C

D

E

F

G

in in in in in in in

Cynthesizer could achieve this with the following schedule

Note that Cynthesizer would automatically produce a new FSM and datapath to meet the desired latency without the user rewriting the algorithm

Also note that this example is extremely simplified In reality, more than one operation will often be chained within a single clock cycle depending on the rela-tionships between the required latency, the clock period, the propagation delay through the adders and multipliers and their relative sizes For instance, if the clock cycle were long enough, and the target process technology were fast enough the design could be scheduled in a single cycle using four adders and two multipliers

5.9 Loops

Unlike RTL, where loops are seldom used, looping constructs are common in high-level design These include loops with non-constant bounds, where the loop termination condition depends on the state of the design and the input data, as well

as simple for-loops with constant bounds

5.9.1 Supported Loop and Loop Termination Statements

Cynthesizer supports loops of all forms in the SystemC input code All the C++ loop statements may be used:

Trang 8

• “for” loops

• “while” loops

• “do/while” loops

The “continue” and “break” statements may be freely used for loop termination

if desired

5.9.2 Directives for Loop Control

Loops can be handled in three ways depending on the parallelism desired by the user

5.9.3 Default Loop Implementation

The default behavior is for Cynthesizer to implement a loop as a looping structure

in the finite-state machine that is built in the synthesized RTL In this case there will be at least one cycle per iteration of the loop This will introduce the minimum parallelism with the one instance of the needed hardware being used over and over for each iteration of the loop

5.9.4 Unrolling

Unrolling a loop creates additional copies of the hardware that implements the loop body These copies can operate in parallel, performing the computation of several iterations of the loop at the same time

Loop unrolling is controlled using the CYN UNROLL directive The simplest form of the directive

CYN UNROLL(ON,"tag");

specifies that the loop be completely unrolled As a convenience, ALL can be specified to completely unroll an entire loop nest

CYN UNROLL( ALL, "tag" )

For example the following would result in four multipliers being used

for ( int i = 0; i < 4; i++ )

{

CYN UNROLL( ON, "example loop" );

array[i] = array[i] * 10;

}

Trang 9

As if it had been written as follows:

array[0] = array[0] * 10;

Loops can also be partially unrolled, creating parallel hardware for fewer than the total number of iterations of the loop using the directive of the form: CYN UNROLL (CONSERVATIVE, N, “tag”);

So, the following loop

for ( int i = 0; i < 4; i++ )

{

CYN UNROLL( CONSERVATIVE, 2, "example loop" );

}

Would be implemented as if it had been written as follows:

for ( int i = 0; i<2; i = i + 2 )

{

array[i + 1] = array[i + 1] * 10;

}

5.9.5 Pipelining

Cynthesizer can automatically perform loop pipelining This can be applied to any loop within the design Pipelining the implementation of an entire thread can be accomplished by applying the pipelining directive to the while(1) loop that consti-tutes the bulk of the thread behavior Consider our earlier example scheduled with a computational latency of 4 Recall that this consumed two adders and one multiplier

to produce a throughput of one value each six cycles

We could pipeline this earlier example as follows

while(true)

{

CYN INITIATE( CONSERVATIVE, 2, "main loop" );

struct data struct data = in port.get();

sc uint<28> X = (A + B + C) * (D + E + F) * G;

out port.put(X);

}

Trang 10

This constrains the synthesis schedule to initiate a new iteration of the loop every two cycles This would result in the following schedule

+ A

B

C

D

E

F

G

*

+

+ +

*

out1

in1

6

in1 in1 in1 in1 in1 in1

+ A

B C D E F G

*

+

+ +

* out2

in2 in2 in2 in2 in2 in2 in2

7 8

Note that the maximum resource utilization occurs beginning in cycle 4 where two adders and one multiplier are used By pipelining the design, we are able to achieve a throughput of two values every eight cycles without using any addi-tional multipliers or adders This is a 50% increase in throughput with no increase

in computing resources Note again, this is done without any need to recode the algorithm

5.10 Verification

The key verification advantage of SystemC high-level synthesis using Cynthesizer

is that the designer is able to:

• Design at a high level

• Verify the algorithm and the interface protocols using high-speed behavioral

simulation

Định dạng
Số trang	10
Dung lượng	247,74 KB