High Level Synthesis: from Algorithm to Digital Circuit- P13 ppt

In this section we report preliminary synthesis results on FPGAs to demonstrate the usage of AutoPilot for three important usage models – hardware synthesis, system-level design explorat

Trang 1

v o i d b l o c k i d c t ( s h o r t i n p u t [ 8 ] [ 8 ] , s h o r t o u t p u t [ 8 ] [ 8 ] ) {

s h o r t b u f f e r [ 8 ] [ 8 ] ;

i d c t r o w ( i n p u t , b u f f e r ) ;

i d c t c o l ( b u f f e r , o u t p u t ) ;

}

Fig 6.3 Pseudo-code for an IDCT block

• Loop pipelining allows multiple successive iterations of a loop to operate in

par-allel by executing one iteration before the previous iteration has completed As a result, the loop throughput as well as the loop latency can be both improved

• Hierarchical functional pipelining pipelines a function so that the same

func-tional body can start processing new input data before its completion on the current data set Given a target throughput constraint (in terms of the number

of cycles after which new data can be introduced), the pipelining can be applied hierarchically to the callee functions

• Multi-function pipelining executes two or more communicating functions

con-currently in a streamed manner For example, Fig 6.3 illustrates an 8× 8 inverse

discrete cosine transform (IDCT) algorithm Multi-function pipelining will

pipeline the execution of row-based transform (idct row) and column-based transform (idct col) and automatically insert the ping-pong memory buffer to

hold the intermediate data produced and consumed by these two functions With

this pipeline, the overall throughput of the entire block idct function can be

significantly increased

6.4.4 Interface Synthesis

With AutoPilot’s platform-based synthesis methodology, designers are not required

to hard code any target-specific interface timing behaviors into the source code Designers can simply use the standard function parameters to expose the desired inputs and outputs to the external circuits AutoPilot interface synthesis is responsi-ble for converting the parameter reads and writes into the actual interface accesses For example, based on the specified communication interfaces in the platform library, a store operation on a scalar pointer (e.g., ∗p = x) can be turned into a

direct wire connection, or a FIFO write, or even a bus transfer (pipelined transfer and burst-mode transfer are both supported)

This capability is particularly convenient for the C and C++ design entries

SystemC-based designs can benefit from this feature as well, although it provides users an array of language constructs to specify the cycle-true and pin-accurate interface connections

Trang 2

6.5 Experimental Results

We have used AutoPilot to synthesize several real-world complex designs for both FPGAs and ASICs for a wide range of applications, including multime-dia image/video processing, digital signal processing, machine learning, financial engineering, and VLSI CAD algorithms

In this section we report preliminary synthesis results on FPGAs to demonstrate the usage of AutoPilot for three important usage models – hardware synthesis, system-level design exploration, and reconfigurable accelerated computing

6.5.1 Hardware Synthesis

6.5.1.1 MPEG-4 Simple Profile Decoder

We used AutoPilot to synthesize a real industrial design, the MPEG-4 simple profile decoder from Xilinx [9] As shown in Fig 6.4 (from [9]), the entire design contains several pipelined modules, which are interconnected by FIFOs or object FIFOs to form a block-level pipeline

In our experiments, the same system-level architecture is used, while each submodule is synthesized by AutoPilot system from a C language specification Manual changes are needed only in a few places to convert the dynamic pointers to synthesizable static pointers

The synthesis results are reported in Table 6.1 AutoPilot automatically generates more than 10X lines of VHDL code over the original C specification Targeting a Xilinx Virtex II-pro FPGA (v2p30), the total resource usage is around 7K slices

It is worth mentioning that final area can be significantly reduced with further

Fig 6.4 Xilinx MPEG-4 simple profile decoder top-level block diagram

Trang 3

Table 6.1 MPEG-4 simple profile decoder synthesis results

Module C source file C line# VHDL line# Slices

Motion Comp motion comp.c 312 4,681 899

Parser/VLD bitstream.c 439 6,093

motion decode.c 492 10,934 2,693

parser.c 1,095 12,036

texture vld.c 504 6,089

Texture/IDCT texture idct.c 1,819 11,537 2,032

Copy control/ copy control.c 287 2,815

texture update texture up.c 220 2,736 1,407

Table 6.2 Alternate HW/SW implementations for MPEG-4 decoder

MicroBlazes PowerPC HW MotionComp

code refinement such as bitwidth annotations on the function parameters The main purpose of this experiment is to demonstrate that AutoPilot can quickly synthesize complex vanilla C code into hardware and meet the performance target We set the final frequency target as 8 ns, and the Xilinx ISE v8.1 static timing analyzer reports positive slacks for all the final modules The final performance can be estimated for each module using the reported frequency and latency results Overall, the through-put requirement of 30 frames per second will be easily achieved for a 352× 288

frame size (CIF format)

6.5.2 System-Level Design Exploration

AutoPilot can also facilitate the quick system-level exploration for embedded designs To demonstrate this advantage, we have explored three alternative imple-mentations of the MPEG-4 simple profile decoder on a Xilinx Virtex II-pro development board The first design comprises seven MicroBlaze soft-core proces-sors, and each processor implements a sub-module of the MPEG-4 decoder The second design uses a single PowerPC core on Xilinx FPGAs to execute the entire MPEG-4 C program The third implementation is a hybrid hardware/software design which offloads the motion compensation block onto the FPGA fabrics using the AutoPilot synthesis

As shown in Table 6.2, the PowerPC version is about 2.6X faster than the soft-core processor network The speedup is primarily due to the higher clock frequency (up to 450 MHz) of the hard-core PowerPC Also, the computation workloads on the seven MicroBlazes are not evenly distributed and thus degrades the performance of the processor pipeline

Trang 4

According to profiling results, the motion compensation module contributes to approximately 16% of the total software decoding time After we synthesize this block on FPGA for the third design, a 15% throughput increase can be observed, which implies that the latency of the time-consuming motion compensation process has been effectively hidden by the automatic synthesis Interestingly, the size of the resulting hardware block (around 900 slices) is smaller than a MicroBlaze processor The performance/area tradeoff of this kind can be easily achieved with the aid of the AutoPilot synthesis

6.5.3 FPGA-Based Accelerated Computing

One innovation forefront in the High-Performance Computing (HPC) field is to har-ness FPGA to accelerate domain-specific applications by one or multiple orders of magnitude over the general-purpose microprocessors

The automatic synthesis support of high-level programming languages (such as

C, C++, and FORTRAN) is paramount important to allow the software designs to

develop algorithms and implement on FPGAs

6.5.3.1 Lithographic Aerial Image Simulation

In this case study we use AutoPilot to accelerate a lithographic aerial image sim-ulation application, which is an essential component in most DFM (Design for Manufacturability) flows The lithography simulation itself is a very computation-ally demanding process and often requires clusters with hundreds CPUs to achieve acceptable turn-around time

The kernel of the simulation engine is a nested loop illustrated in Fig 6.5 Abundant data-level parallelism can be exposed by careful loop unrolling and

f o r ( x = 0 ; x < p i x e l m a x ; + + x ) {

f o r ( y = 0 ; y < p i x e l m a x ; + + y ) {

/ / I n i t i a l i z e p i x e l i n t e n s i t i e s

I [ x ] [ y ] = 0 ;

f o r ( k = 0 ; k < K; + + k ) {

/ / I n i t i a l i z e p a r t i a l sum

I k [ x ] [ y ] = 0 ;

/ / Core c o m p u t a t i o n

f o r ( n = 0 ; n < 4 N; + + n ) {

addr x = 5* x − rect x [n] + c ;

addr y = 5* x rect y [n] + c ;

I k [ x ] [ y ] + = ( 1)n * k e r n e l [ k ] [ addr x ] [ addr x] ;

}

I [ x ] [ y ] + = I k [ x ] [ y ] * I k [ x ] [ y ] ;

}

*

−

Fig 6.5 Pseudo-code for the simulation kernel

Trang 5

array/memory partitioning Loop pipelining and multi-function pipelining are also applied to further increase the performance

The whole algorithm is written in 2,226 lines of C code and synthesized by AutoPilot, which generates about 24K lines of VHDL code The accelerator has been implemented on XtremeData XD1000TMdevelopment system [3] The devel-opment system uses a dual OpteronTMmotherboard and one of the Opteron proces-sors is replaced by an XD1000 co-processor module The XD1000 co-processor is built around an Altera Stratix II EP2S180, and is compatible with Opteron Socket

940 The FPGA co-processor communicates with the host Opteron CPU via the HyperTransportTMlinks

We use Altera Quartus II v6.0 to implement the generated RTLs on the Stratix

II FPGA Table 6.3 shows the resource usage of the synthesized accelerator, which consumes around 30% of the device resources in ALUT logic and memory bits The final clock frequency is above 100 MHz

To measure the performance speedup, we conduct experiments on a 200×

200 um chip layout specified in GDSII format We divide the image into 1,000 ×

1,000 nm regions and simulate each region with a kernel look-up table sized

2,000 nm by 2,000 nm We also generate a number of layouts with different

den-sities (N) The software implementation runs on the AMD Opteron 248 processor at

2.2 GHz with a 4 GB DDR memory The program is compiled through GCC-O3.

Table 6.3 Resource usage of the synthesized accelerator with 5× 5 partitioning

ALUTs Memory bits Fmax (MHz) Accelerator 23,641 2,883,296 117.01

0

20

40

60

80

100

120

140

160

0 20 40 60 80 100 120 140 160 180 200

N

with accelerator without accelerator

Fig 6.6 Execution time comparison with and without the synthesized accelerator

Trang 6

Figure 6.6 shows the measured execution time and speedup with different

lay-out densities N Note that for a very small N, the speedup gets degraded since the

communication time dominates the computation time on the FPGA For a moderate

N, we can achieve a speedup around 15X even with the communication overhead

between the CPU and the hardware accelerator

The acceleration on FPGA also provides significant power and energy savings According to Altera Quartus II PowerPlay analysis tool, the synthesized hardware block consumes 6,954 mW, which is 10X smaller than the power consumption of the

AMD Opteron processor (about 70 W) Considering the 15X performance speedup,

we can achieve a 150X energy saving over the CPU

Acknowledgments The authors would like to thank Xilinx for providing the MPEG-4 decoder

example, XtremeData for lending the XD1000 development platform, and Yi Zou at UCLA for sharing the lithographic simulation result.

References

1 SystemC Synthesizable Subset (Draft 1.1.18), 2004 Open SystemC Initiative http://www.

systemc.org

2 IEEE 1666 T M –2005 Standard for SystemC, 2005 IEEE and OCSI http://www systemc.org

3 XD1000 TM FPGA Coprocessor Module for Socket 940, 2006 XtremeData Inc.

http://www.xtremedatainc.com

4 H100 Series FPGA Application Accelerators, 2007 Nallatech http://www nallatech.com

5 Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang, Z (2006) Platform-Based Behavior-Level

and System-Level Synthesis In Proc IEEE International SOC Conference, pages 199–202

6 Cong, J., Fan, Y., and Jiang, W (2006) Platform-Based Resource Binding Using a

Dis-tributed Register-File Microarchitecture In Proc International Conference on

Computer-Aided Design, pages 709–715

7 Cong, J and Zhang, Z (2006) An Efficient and Versatile Scheduling Algorithm Based on

SDC Formulation In Proc Design Automation Conference, pages 433–438

8 Ghenassia, F (2005) Transaction-Level Modeling with SystemC: TLM Concepts and

Appli-cations for Embedded Systems Springer, Berlin Heidelberg New York

9 Schumacher, P., Denolf, K., Chilira-RUs, A., Turney, R., Fedele, N., Vissers, K., and Bormans,

J (2005) A Scalable, Multi-Stream MPEG-4 Video Decoder for Conferencing and

Surveil-lance Applications In Proc IEEE International Conference on Image Processing, pages II:

886–889

10 Wakabayashi, K (2004) C-Based Behavioral Synthesis and Verification Analysis on

Indus-trial Design Examples In Proc ASPDAC, pages 344–348

Trang 7

“All-in-C” Behavioral Synthesis and Verification with CyberWorkBench

From C to Tape-Out with No Pain and A Lot of Gain

Kazutoshi Wakabayashi and Benjamin Carrion Schafer

Abstract This chapter introduces the benefits of C language-based behavioral

syn-thesis design methodology over traditional RTL-based methods for System LSI, or SoC designs A comprehensive C-based tool flow, based on CyberWorkBenchTM (CWB), developed during the last 20 years at NEC’s R&D laboratories is intro-duced This includes behavioral synthesis and formal verification and hardware– software co-simulation of entire complex SoC First we introduce the “all-in-C” concept based on CWB

Then we discuss the behavioral synthesis for various types of circuits and exam-ine the advantages of behavioral synthesis on the hand of commercial ICs We show that currently entire SoCs are created using this flow in a fraction of the time taken

by traditional approaches

Behavioral IP and C-based configurable processor synthesis and automatic archi-tecture exploration is explained next At the end we demonstrate a real world example of a mobile phone SoC where most of the modules are synthesized from C descriptions using CWB

Keywords: Behavioral synthesis, Control and data intensive flows, All-in-C,

Behavioral C level formal verification, Hardware-software co-simulation, Auto-matic system exploration, Behavioral IP, Configurable processor

7.1 Introduction

The design productivity gap problem is becoming more and more serious as VLSI systems become larger In the mid-1980s, gate-level design shifted to register trans-fer level (RTL) design for designs that typically exceeded 100K gates (we assume a hundred thousand gates is the upper limit for hand coded modules to be designed in several months)

Currently, several million gates circuits are commonly used just for random logic parts of a design, which equate to more than several hundreds thousand lines of RTL

P Coussy and A Morawiec (eds.) High-Level Synthesis.

c

Trang 8

code It is therefore needed to move the design abstraction one more level in order

to cope with this increasing complexity Behavioral synthesis is a logic way to go as

it allows “less detailed design description” and “higher reusability”

A higher level of abstraction description requires smaller code and provides faster simulation times For example a one million gates circuit requires about 300K lines

of RTL (Verilog or VHDL) code, but only around 40K lines of C code The RTL simulation of 300K lines, we observed in [1], is on average 10–100 times slower than the 40K lines of equivalent behavioral code (it is important to note that in order

to benefit from higher level of abstraction the entire design needs to be modeled at the behavioral level)

It is sometimes claimed that behavioral synthesis is only useful for dataflow intensive circuits, but not for control dominated circuits We believe that behavioral

synthesis can and should be used for all hardware modules in order to truly benefit

from it We will demonstrate this by an example of a real complex SoC design where all custom design modules, except the analog ones, have been designed using behav-ioral synthesis NEC Electronics adopted behavbehav-ioral synthesis as standard design methodology since 2003 and taped out since then several hundreds million Dollars worth of “C-based” chips every year

Since the benefits of behavioral synthesis are palpable through multiple com-mercial chip successes, Behavior Synthesis, or High Level Synthesis, is gaining acceptance within the design community, especially in Japanese industries Various commercial chips for printers, mobile phones, set-top-boxes and digital cameras are designed using behavioral synthesis these days ANSI-C is the preferred pro-gramming language for behavioral synthesis because embedded software is often described in C and design tools like compilers, debuggers, libraries and editors are easily available and there is a big amount of legacy code

In this paper, we first provide an overview of our C-based design flow where

we compare the efficiency and simulation performance against pure RTL as well

as co-simulating it with embedded software We show the advantages of C-based behavioral IPs over RTL IPs and how application specific processors can benefit from it We present a hardware architecture explorer at the behavioral level allow-ing a fast and easy way to study the area, performance and power trade-offs of different designs automatically Finally we demonstrate on a real complex design, how behavioral synthesis can be used for any hardware module (data and control intensive)

7.2 C-Based Design Flow

We have been developing C-based behavioral synthesis called “Cyber” since the late 1980s [2] and developing C-based verification tools such as formal verification and simulation around Cyber during the last 10 years [3] All these tools are integrated into an IDE, where designers execute these tools upon the C-source code We named this IDE tool suite “CyberWorkBenchTM”

Trang 9

7.2.1 Basic Concept of CyberWorkBench

The main idea behind CyberWorkBench is an “all-in-C” approach This is built around two principal ideas (1) “all-modules-in-C” and (2) “all-processes-on-C”

(1) All-modules-in-C: means that all modules in a VLSI design, including control

intensive circuits and data dominant circuits, should be described in behavioral

C language Our system supports legacy RTL or gatenetlist blocks as black boxes, which are called as C functions At the same time it allows designers

to create all new parts in C, although this is not recommended as the designer will need to use two different programming languages and RTL parts will slow down the simulation

(2) All-processes-on-C: means that synthesis and verification (including

debug-ging) tasks should be done at the C source code As an example we can compare this with a software compiler In a software compiler, a designer does not have to debug the generated machine language (or, assembler language) directly Simi-larly, in behavioral synthesis, a designer should not have to debug the generated RTL code Our CWB environment allows a designer to debug the original C source code and the CWB model checker allows designer to write properties or assertions directly on the C source code

7.2.2 Design Flow Overview

CWB targets general LSI systems which normally contain several CPUs or DSPs, dedicated hardware modules and some pre-designed or fixed RTL- or gate level IP modules, which are directly connected or through buses

Initially, each dedicated hardware module such as an ECC encryption module is described in behavioral C Once its functionality is verified using the C simulator and debugger, the hardware module is synthesized with our behavioral synthesizer Configurable processors are also synthesized from their C description in our envi-ronment Legend RTL modules are described as function, and handled as a black box The CPU bus and bus interface circuits are automatically generated using a CPU bus library After synthesizing and verifying each hardware module, our design environment allows designers to create a cycles-accurate simulation model for the entire system including CPUs, DSPs and custom hardware modules With this sim-ulation model, designers can verify both functionality and performance of their hardware design as well as the embedded software run on the CPU, DSP and/or generated configurable processors Behavioral synthesis is quick enough to allow designers to repeatedly modify and synthesis the hardware modules and embedded software The behavioral C source code can also be debugged with our formal ver-ification, property/assertion model checker tool Global properties and in-context (immediate) assertions are described for/in the C source code The equivalence between behavioral C and generated RTL can be verified both in dynamic and static

Trang 10

Fig 7.1 CyberWorkBenchTM design flow

way, as described later Currently, the architectural level parallelization is left to the designer The designer partitions the C source code into individual hardware mod-ules and embedded software based on the performance result of the cycle simulation

or FPGA emulation

7.2.2.1 Synthesis Flow

Our design flow is shown in Fig 7.1 A hardware design in extended ANSI-C (called

“BDL”, or “Cyber-C”) [4], or SystemC is synthesized into synthesizable RTL with our “Cyber” behavioral synthesizer [1] with a set of design constraints such as clock frequencies, number and kind of functional units and memories Usually RTL is handled as a black box, but if necessary, the RTL can also be fed to the behavioral synthesizer The behavioral synthesizer can insert extra registers to speed up the original RTL and generate new RTL of smaller delay It also generates a cycle accu-rate simulation models in C++ or SystemC The behavioral synthesis can therefore

be considered as a Verilog, VHDL, C, C++, and SystemC unification step

The “Library Characterizer” generates delay and area information of the func-tional units and memories on a particular technology or FPGA

A Behavioral IP library, called “Cyberware”, is also included in the synthesis environment Any part of the behavioral IP can be encrypted for security purposes

Định dạng
Số trang	10
Dung lượng	775,44 KB