Model-Based Design for Embedded Systems- P13 doc

Furthermore,because the MONTIUM has a coarse-grained reconfigurable architecture, theconfiguration memory is relatively small.. Associative processing is the property of instructions toe

Trang 1

MONTIUMC The MONTIUMdesign methodology to map DSP applications onthe MONTIUMTP is divided into three steps:

1 The high-level description of the DSP application is analyzed and putationally intensive DSP kernels are identified

com-2 The identified DSP kernels or parts of the DSP kernels are mapped onone or multiple MONTIUMTPs that are available in a SoC The DSP oper-ations are programmed on the MONTIUMTP using MONTIUMC

3 Depending on the layout of the SoC in which the MONTIUMprocessingtiles are applied, the MONTIUMprocessing tiles are configured for a par-ticular DSP kernel or part of the DSP kernel Furthermore, the channels

in the NoC between the processing tiles are configured

11.3.1.3 A NNABELLE Heterogeneous System-on-Chip

In this section, the prototype ANNABELLE SoC is described according to theheterogeneous SoC template mentioned before, which is intended to beused for digital radio broadcasting receivers (e.g., digital audio broadcast-ing, digital radio mondiale) Figure 11.6 shows the overall architecture of the

ANNABELLESoC The ANNABELLESoC consists of an ARM926 GPP with a layer AMBA AHB, four MONTIUMTPs, an NoC, a Viterbi decoder, two ADCs,two DDCs, a DMA controller, SRAM/SDRAM memory interfaces, and exter-nal bus interfaces

five-The four MONTIUM TPs and the NoC are arranged in a reconfigurablesubsystem, labelled “reconfigurable fabric.” The reconfigurable fabric is con-nected to the AHB bus and serves as a slave to the AMBA system A config-urable clock controller generates the clocks for the individual MONTIUMTPs.Every individual MONTIUMTP has its own adjustable clock and runs at itsown speed A prototype chip of the ANNABELLESoC has been produced usingthe Atmel 130 nm CMOS process [8]

The reconfigurable fabric that is integrated in the ANNABELLE SoC

is shown in detail in Figure 11.7 The reconfigurable fabric acts as a

Clock controller

ARM926

GPP

DMA controller

MONTIUM TP

MONTIUM TP CCU

DDC DDC ADC ADC SDRAMSRAM/ controllerIRQ decoderViterbi

External bus interface 5-Layer AMBA advanced high-performance bus

Trang 2

Queue 0 Control Queue 1

FIGURE 11.7

The ANNABELLESoC reconfigurable fabric

reconfigurable coprocessor for the ARM926 processor Computationallyintensive DSP algorithms are typically offloaded from the ARM926 proces-sor and processed on the coarse-grained reconfigurable MONTIUMTPs insidethe reconfigurable fabric The reconfigurable fabric contains four MONTIUM

TPs, which are connected via a CCU to a circuit-switched NoC The urable fabric is connected to the AMBA system through a AHB–NoC bridgeinterface Configurations, generated at design-time can be loaded onto the

reconfig-MONTIUMTPs at run-time The reconfigurable fabric provides “block mode”and “streaming mode” computation services

For ASIC synthesis, worst-case military conditions are assumed In ticular, the supply voltage is 1.1 V and the temperature is 125◦C Resultsobtained with the synthesis are as follows:

par-• The area of one MONTIUMcore is 3.5 mm2of which 0.2 mm2 is for theCCU and 3.3 mm2is for the MONTIUMTP (including memory)

• With Synopsys tooling we estimated that the MONTIUMTP, within the

ANNABELLEASIC realization, can implement an FIR filter at about 100MHz or an FFT at 50 MHz The worst-case clock frequency of the

ANNABELLEchip is 25 MHz

• With the Synopsys prime power tool, we estimated the energyconsumption using placed and routed netlists The following sectionprovides some of the results

Trang 3

11.3.1.4 Average Power Consumption

To determine the average power consumption of the ANNABELLEas accurate

as possible, we performed a number of power estimations on the placedand routed netlist using the Synopsys power compiler Table 11.2 pro-vides the dynamic power consumption in mW/MHz of various MONTIUM

blocks for three well-known DSP algorithms These figures show that theoverhead of the sequencer and decoder is low:<16% of the total dynamic

power consumption Finally, Table 11.3 compares the energy consumption

of the MONTIUMand the ARM926 on ANNABELLE For the FIR-5 algorithm thememory is not used

11.3.1.5 Locality of Reference

As mentioned above, locality of reference is an important design ter One of the reasons for the excellent energy figures of the MONTIUMis theuse of locality of reference To illustrate this, Table 11.4 gives the amount ofmemory references local to the cores compared to the amount of off-corecommunications These figures are, as expected, algorithm dependent.Therefore, in this table, we chose three well-known algorithms in the

Trang 4

parame-TABLE 11.4

Internal and External Memory References per Execution of

an Algorithm

Number of Memory References

1024p FFT 51200 4096 12.5

200 tap FIR 405 2 202.5

SISO algorithm (N softbits) 18*N 3*N 6

TABLE 11.5

Reconfiguration of Algorithms on the MONTIUM

1024p FFT Scaling factors ≤150 bit ≤10

to iFFT Twiddle factors 16384 bit 512

200 tap FIR Filter coefficients ≤3200 bit ≤80

streaming DSP application domain: a 1024p FFT, a 200 tap FIR filter, and

a part of a Turbo decoder (SISO algorithm [17]) The results show that forthese algorithms 80%–99% of the memory references are local (within a tile)

11.3.1.6 Partial Dynamic Reconﬁguration

One of the advantages of a multicore SoC organization is that each ual core can be reconfigured while the other cores are operational In the

individ-MONTIUM, the configuration memory is organized as a RAM memory Thismeans that to reconfigure the MONTIUM, the entire configuration memoryneed not be rewritten, but only the parts that are changed Furthermore,because the MONTIUM has a coarse-grained reconfigurable architecture, theconfiguration memory is relatively small The MONTIUMhas a configurationsize of only 2.6 kB Table 11.5 gives some examples of reconfigurations

To reconfigure a MONTIUMfrom executing a 1024 point FFT to executing

a 1024 point inverse FFT requires updating the scaling and twiddle factors.Updating these factors requires less than 522 clock cycles in total To changethe coefficients of a 200 tap FIR filter requires less than 80 clock cycles

11.3.2 Aspex Linedancer

The Linedancer [4] is an “associative” processor and it is an example of ahomogeneous SoC Associative processing is the property of instructions toexecute only on those PEs where a certain value in their data register matches

a value in the instruction Associative processing is built around an ligent memory concept: content addressable memory (CAM) Unlike stan-dard computer memory (random access memory or RAM) in which the user

Trang 5

intel-supplies a memory address and the RAM returns the data word stored at thataddress, a CAM is designed such that the user supplies a data word and theCAM searches its entire memory to see if that data word is stored anywhere

in it If the data word is found, the CAM returns a tag list of one or morestorage addresses where the word was found Each CAM line, that contains

a word, can be seen as a processor element (PE) and each tag list element as

a 1 bit condition register Dependending on this register, the aggregate ciative processor can either instruct the PEs to continue processing on theindicated subset, or to return the involved words subsequently for furtherprocessing There are several implementations possible which vary from bitserial to word parallel, but the latest implementations [4,5] can perform theinvolved lookups in parallel in a single clock cycle

asso-In general the Linedancer belongs to the subclass of massively parallelSIMD architectures, with typically more than 512 processors This SIMD sub-class is perfectly suited to support data parallelism, for example, for signal,image, and video processing; text retrieval; and large databases The asso-ciative functions furthermore allow the processor to function like an intelli-gent memory (CAM), permitting high speed searching and data-dependentimage processing operations (such as median filters and object recognition/labeling)

The so called “ASProCore” of the Linedancer, is designed around a verylarge number—up to 4,096—of simple PEs arranged in a line, see Figure 11.8.Application areas are diverse but have in common the simple process-ing of very large amounts of data, from samples in 1D-streams to pixels in2D or 3D-images To mention a few: software defined radio (e.g., WiMAX),broadcast (Video compression), medical imaging (3D reconstruction), and inhigh-end printers—in particular for raster image processing (RIP)

On-chip or off-chip memory

FIGURE 11.8

The scalable architecture of Linedancer

Trang 6

64 3

Inter-Bulk IO memory

Data

LLP

FIGURE 11.9

The architecture of Linedancer’s associative string processor (ASProCore)

In the following sections, the associative processor (ASProCore) and theLinedancer family are introduced At the end, we present the developmenttool chain and a brief conclusion on the Linedancer application domain

11.3.2.1 ASProCore Architecture

Each PE has a 1-bit ALU, 32–64 bit full associative memory array, and 128bit extended memory See Figure 11.9 for a detailed view on the ASProCorearchitecture The processors are connected in a 1D network, actually a 4Kbit shift register, in between the indicated “left link port” (LLP) and “rightlink port” (RLP) The network allows data to be shared between PEs withminimum overhead The ASProCore also has a separate word serial bit par-allel memory, the primary data store (PDS), for high-speed data input Theon-chip DMA engine automatically translates 2D and 3D images into the 1Darray (and passed through via the PDS) The 1D architecture allows for linearscaling of performance, memory, and communication, provided the applica-tion is expressed in a scalable manner The Linedancer features also a single

or dual bit RISC core (P1, HD, respectively) for sequential processing andcontrolling the ASProCore

11.3.2.2 Linedancer Hardware Architecture

The current Linedancers, the P1 and the HD, have been realized in 0.13 μmCMOS process Both have one or two 32-bit SPARC core(s) with 128 kB inter-nal program memory System clock frequencies vary from 300, 350, 400 MHz.The Linedancer-P1 integrates an associative processor (ASProCore, with 4KPEs), a single SPARC core with a 4 kB instruction cache, and a DMA con-troller capable of transferring 64 bit at 66 MHz over a PCI-interface, as shown

in Figure 11.10

It further hosts 128 kB internal data memory The chip consumes 3.5 Wtypical at 300 MHz The Linedancer-HD integrates two associative proces-sors (2× 2K PEs), two SPARC cores with each 8 kB instruction cache and

Trang 7

DMA engine

32 bit RISC CPU

128 kB RAM

External DRAM (prog)

External DRAM (data)

Data

FIGURE 11.10

The Linedancer-P1 layout

4 kB data cache, four internal DMA engines, and an external data channelcapable of transferring 64 bit at 133 MHz over a PCI-X interface, as shown

in Figure 11.11 The ASProCore has been extended with a chordal ringinter-PE communication network that allows for faster 2D- and 3D-imageprocessing It further hosts four external DDR2 DRAM interfaces, eightdedicated streaming data I/O ports (up to 3.2 GB/s), and 1088 kB internaldata memory The chip consumes 4.5 W typical at 300 MHz

11.3.2.3 Design Methodology

The software development environment for Linedancer consists of a piler, linker, and debugger The Linedancer is programmed in C, with someparallel extensions to support the ASProCore processing array The toolchain

com-is based on the GNU compiler framework, with dedicated pre and cessing tools to compile and optimise the parallel extensions to C

postpro-Associative SIMD processing adds an extra dimension to massive parallelprocessing, enabling new views on problem modeling and the subsequentimplementation (for example, in searching/sorting and data-dependentimage processing) The Linedancer’s 1D-architecture scales better than a 2Darray often used in multi-ALU arrays as PACT’s XPP [6] or the Tilera’s 2Dmulticore array [7] Because of the large size of the array, power consumption

is relatively high compared to the MONTIUMprocessor and prevents tion into handheld devices

Trang 8

applica-GPIO PCI-X

Control

External data DRAM (4 banks)

JTAG Program

memory

Program memory

V7 ASProCore 4,096 processing elements

1 Mbit storage

8 × Direct data interfaces

4 × DMA engines

Internal data memory

32 bit RISC CPU

11.3.3.1 Architecture

The XPP architecture is based on a hierarchical array of coarse-grained, tive computing elements, called processing array elements (PAEs) The PAEare clustered in processing array clusters (PACs) All PAEs in the XPP archi-tecture are connected through a packet-oriented communication network.Figure 11.12 shows the hierarchical structure of the XPP array and the PAEsclustered in a PAC

adap-Different PAEs are identified in the XPP array: “ALU-PAE, RAM-PAE,”and “FNC-PAE.” The ALU-PAE contains a multiplier and is used for DSPoperations The RAM-PAE contains a RAM to store data The FNC-PAE is aunique sequential VLIW-like processor core The FNC-PAEs are dedicated tothe control flow and sequential sections of applications Every PAC containsALU-PAEs, RAM-PAEs, and FNC-PAEs The PAEs operate according to adata flow principle; a PAE starts processing data as soon as all required inputpackets are available If a packet cannot be processed, the pipeline stalls untilthe packet is received

Trang 9

SCM CM

PAC

PAC IO

RAM RAM RAM RAM

FNC FNC FNC FNC

ALU ALU ALU ALU

IO

CM

FIGURE 11.12

The structure of an XPP array composed of four PACs (From Baumgarte, V

et al., J Supercomput., 26(2), 167, September 2003.)

Each PAC is controlled by a configuration manager (CM) The CM isresponsible for writing configuration data into the configurable object of thePAC Multi-PAC XPP arrays contain additional CMs for concurrent con-figuration data handling, arranged in a hierarchical tree of CMs The top

CM, called supervising CM (SCM), has an external interface, not shown inFigure 11.12, that connects the supervising CM to an external configurationmemory

11.3.3.2 Design Methodology

DSP algorithms are directly mapped onto the XPP array according to theirdata flow graphs The flow graph nodes define the functionality and oper-ations of the PAEs, whereas the edges define the connections betweenthe PAEs The XPP array is programmed using the native mapping lan-guage (NML), see [20] In NML descriptions, the PAEs are explicitly allo-cated and the connections between the PAEs are specified Optionally, theallocated PAEs are placed onto the XPP array NML also includes statements

Trang 10

to support configuration handling Configuration handling is an explicit part

of the application description

A vectorizing C compiler is available to translate C functions to NMLmodules The vectorizing compiler for the XPP array analyzes the code fordata dependencies, vectorizes those code sections automatically, and gener-ates highly parallel code for the XPP array The vectorizing C compiler istypically used to program “regular” DSP operations that are mapped on

“ALU-PAEs” and “RAM-PAEs” of the XPP array Furthermore, a grained parallelization into several FNC-PAE threads is very useful when

coarse-“irregular” DSP operations exist in an application This allows running evenirregular, control-dominated code in parallel on several FNC-PAEs TheFNC-PAE C compiler is similar to a conventional RISC compiler extendedwith VLIW features to take advantage of ILP within the DSP algorithms

11.3.4 Tilera

The Tile64 [7] is a TP based on the mesh architecture that was originallydeveloped for the RAW machine [26] The chip consists of a grid of processortiles arranged in a network (see Figure 11.13), where each tile consists of aGPP, a cache, and a nonblocking router that the tile uses to communicatewith the other tiles on the chip

The Tilera processor architecture incorporates a 2D array of homogenous,general-purpose cores Next to each processor there is a switch that connects

Trang 11

the core to the iMesh on-chip network The combination of a core and aswitch form the basic building block of the Tilera Processor: the tile Eachcore is a fully functional processor capable of running complete operatingsystems and off-the-shelf “C” code Each core is optimized to provide a highperformance/power ratio, running at speeds between 600 MHz and 1 GHz,with power consumption as low as 170 mW in a typical application Eachcore supports standard processor features such as

• Full access to memory and I/O

• Virtual memory mapping and protection (MMU/TLB)

• Hierarchical cache with separate L1-I and L1-D

• Multilevel interrupt support

• Three-way VLIW pipeline to issue three instructions per cycle

The cache subsystem on each tile consists of a high-performance, level, non-blocking cache hierarchy Each processor/tile has a split level 1cache (L1 instruction and L1 data) and a level 2 cache, keeping the design,fast and power efficient When there is a miss in the level 2 cache of a spe-cific processor, the level 2 caches of the other processors are searched for thedata before external memory is consulted This way, a large level 3 cache isemulated

two-This promotes on-chip access and avoids the bottleneck of off-chip globalmemory Multicore coherent caching allows a page of shared memory,cached on a specific tile, to be accessed via load/store references to othertiles Since one tile effectively prefetches for the others, this technique canyield significant performance improvements

To fully exploit the available compute power of large numbers of cessors, a high-bandwidth, low-latency interconnect is essential The net-work (iMesh) provides the high-speed data transfer needed to minimizesystem bottlenecks and to scale applications iMesh consists of five distinctmesh networks: Two networks are completely managed by hardware andare used to move data to and from the tiles and memory in the event ofcache misses or DMA transfers The three remaining networks are availablefor application use, enabling communication between cores and betweencores and I/O devices A number of high-level abstractions are supplied foraccessing the hardware (e.g., socket-like streaming channels and message-passing interfaces.) The iMesh network enables communication withoutinterrupting applications running on the tiles It facilitates data transferbetween tiles, contains all of the control and datapath for each of the net-work connections, and implements buffering and flow control within all thenetworks

pro-11.3.4.1 Design Methodology

The TILE64 processor is programmable in ANSI standard C and C++ Tilescan be grouped into clusters to apply the appropriate amount of processingpower to each application and parallelism can be explicitly specified

Trang 12

11.4 Conclusion

In this chapter, we addressed reconfigurable multicore architectures forstreaming DSP applications Streaming DSP applications express computa-tion as a data flow graph with streams of data items (the edges) flowingbetween computation kernels (the nodes) Typical examples of streamingDSP applications are wireless baseband processing, multimedia processing,medical image processing, and sensor processing These application domainsrequire flexible and energy-efficient architectures This can be realized with amulticore architecture The most important criteria for designing such a mul-ticore architecture are predictability and composability, energy efficiency,programmability, and dependability Two other important criteria are per-formance and flexibility Different types of processing cores have been dis-cussed, from ASICs, reconfigurable hardware, to DSPs and GPPs ASICshave high performance but suffer from poor flexibility while DSPs and GPPsoffer flexibility but modest performance Reconfigurable hardware combinesthe best of both worlds These different processing cores are, together withmemory- and I/O blocks assembled into MP-SoCs MP-SoCs can be clas-sified into two groups: homogeneous and heterogeneous In homogeneousMP-SoCs, multiple cores of a single type are combined whereas in a hetero-geneous MP-SoC, multiple cores of different types are combined

We also discussed four different architectures: the MONTIUM/ANNABELLE

SoC, the Aspex Linedancer, the PACT-XPP, and the Tilera processor The

MONTIUM, a coarse-grain, run-time reconfigurable core has been used as one

of the building blocks of the ANNABELLE SoC The ANNABELLE SoC can beclassified as a heterogeneous MP-SoC The Aspex Linedancer is a homoge-neous MP-SoC where a single instruction is executed by multiple processorssimultaneously (SIMD) The PACT-XPP is an array processor where multi-ple ALUs are combined in a 2D structure The Tilera processor is an example

of a homogeneous MIMD MP-SoC

References

1 The International Technology Roadmap for Semiconductors, ITRSRoadmap 2003 Website, 2003 http://public.itrs.net/Files/2003ITRS/Home2003.htm

2 A coarse-grained reconfigurable architecture template and its lation techniques PhD thesis, Katholieke Universiteit Leuven, Leuven,Belgium, January 2005

compi-3 Nvidia g80, architecture and gpu analysis, 2007

Trang 13

4 Aspex Semiconductor: Technology Website, 2008 semi.com/q/technology.shtml.

http://www.aspex-5 Mimagic 6+ Enables Exciting Multimedia for Feature Phones site, 2008 http://www.neomagic.com/product/MiMagig6+_Product_Brief.pdf/

Web-6 PACT http://www.pactxpp.com/main/index.php, 2008

7 Tilera Corporation http://www.tilera.com/, 2008

8 Atmel Corporation ATC13 Summary http://www.atmel.com, 2007

9 A Banerjee, P.T Wolkotte, R.D Mullins, S.W Moore, and Gerard J.M.Smit An energy and performance exploration of network-on-chip archi-

tectures IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

17(3): 319–329, March 2009

10 V Baumgarte, G Ehlers, F May, A Nückel, M Vorbach, and M hardt PACT XPP—A self-reconfigurable data processing architecture

Wein-Journal of Supercomputing, 26(2):167–184, September 2003

11 M.D van de Burgwal, G.J.M Smit, G.K Rauwerda, and P.M Heysters

Hydra: An energy-efficient and reconfigurable network interface In

Pro-ceedings of the International Conference on Engineering of Reconfigurable tems and Algorithms (ERSA’06), Las Vegas, NV, pp 171–177, June 2006

Sys-12 G Burns, P Gruijters, J Huisken, and A van Wel Reconfigurable

accelerator enabling efficient sdr for low-cost consumer devices In SDR

Technical Forum, Orlando, FL, November 2003

13 A.P Chandrakasan, S Sheng, and R.W Brodersen Low-power cmos

digital design IEEE Journal of Solid-State Circuits, 27(4):473–484, April

1992

14 W.J Dally, U.J Kapasi, B Khailany, J.H Ahn, and A Das Stream

pro-cessors: Progammability and efficiency Queue, 2(1):52–62, 2004.

15 European Telecommunication Standard Institute (ETSI) Broadband Radio

Access Networks (BRAN); HIPERLAN Type 2; Physical (PHY) Layer, ETSI

TS 101 475 v1.2.2 edition, February 2001

16 Y Guo Mapping applications to a coarse-grained reconfigurable tecture PhD thesis, University of Twente, Enschede, the Netherlands,September 2006

archi-17 P.M Heysters, L.T Smit, G.J.M Smit, and P.J.M Havinga Max-log-map

mapping on an fpfa In Proceedings of the 2005 International Conference

on Engineering of Reconfigurable Systems and Algorithms (ERSA’02), LasVegas, NV, pp 90–96, June 2002 CSREA Press, Las Vegas, NV

Trang 14

18 P.M Heysters Coarse-grained reconfigurable processors – flexibilitymeets efficiency PhD thesis, University of Twente, Enschede, theNetherlands, September 2004.

19 R.P Kleihorst, A.A Abbo, A van der Avoird, M.J.R Op de Beeck,

L Sevat, P Wielage, R van Veen, and H van Herten Xetal: A low-power

high-performance smart camera processor IEEE International Symposium

on Circuits and Systems, 2001 ISCAS 2001, 5:215–218, 2001

20 PACT XPP Technologies http://www.pactcorp.com, 2007

21 D.C Pham, T Aipperspach, D Boerstler, M Bolliger, R Chaudhry,

D Cox, P Harvey et al Overview of the architecture, circuit design, and

physical implementation of a first-generation cell processor IEEE Journal

of Solid-State Circuits, 41(1):179–196, January 2006

22 G.K Rauwerda, P.M Heysters, and G.J.M Smit Towards software

defined radios using coarse-grained reconfigurable hardware IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, 16(1):3–13,January 2008

23 Recore Systems http://www.recoresystems.com, 2007

24 G J M Smit, A B J Kokkeler, P T Wolkotte, and M D van de Burgwal.Multi-core architectures and streaming applications In I Mandoiu and

A Kennings (editors), Proceedings of the Tenth International Workshop on

System-Level Interconnect Prediction (SLIP 2008), New York, pp 35–42,April 2008 ACM Press, New York

25 S.R Vangal, J Howard, G Ruhl, S Dighe, H Wilson, J Tschanz, D Finan

et al An 80-tile sub-100-w teraflops processor in 65-nm cmos IEEE

Jour-nal of Solid-State Circuits, 43(1):29–41, January 2008

26 E Waingold, M Taylor, D Srikrishna, V Sarkar, W Lee, V Lee, J Kim

et al Baring it all to software: Raw machines Computer, 30(9):86–93,

September 1997

Tiêu đề	Model-Based Design for Embedded Systems
Trường học	Standard University
Chuyên ngành	Embedded Systems
Thể loại	Luận văn
Năm xuất bản	2023
Thành phố	City Name

Định dạng
Số trang	30
Dung lượng	727,22 KB