Furthermore,because the MONTIUM has a coarse-grained reconfigurable architecture, theconfiguration memory is relatively small.. Associative processing is the property of instructions toe
Trang 1MONTIUMC The MONTIUMdesign methodology to map DSP applications onthe MONTIUMTP is divided into three steps:
1 The high-level description of the DSP application is analyzed and putationally intensive DSP kernels are identified
com-2 The identified DSP kernels or parts of the DSP kernels are mapped onone or multiple MONTIUMTPs that are available in a SoC The DSP oper-ations are programmed on the MONTIUMTP using MONTIUMC
3 Depending on the layout of the SoC in which the MONTIUMprocessingtiles are applied, the MONTIUMprocessing tiles are configured for a par-ticular DSP kernel or part of the DSP kernel Furthermore, the channels
in the NoC between the processing tiles are configured
11.3.1.3 A NNABELLE Heterogeneous System-on-Chip
In this section, the prototype ANNABELLE SoC is described according to theheterogeneous SoC template mentioned before, which is intended to beused for digital radio broadcasting receivers (e.g., digital audio broadcast-ing, digital radio mondiale) Figure 11.6 shows the overall architecture of the
ANNABELLESoC The ANNABELLESoC consists of an ARM926 GPP with a layer AMBA AHB, four MONTIUMTPs, an NoC, a Viterbi decoder, two ADCs,two DDCs, a DMA controller, SRAM/SDRAM memory interfaces, and exter-nal bus interfaces
five-The four MONTIUM TPs and the NoC are arranged in a reconfigurablesubsystem, labelled “reconfigurable fabric.” The reconfigurable fabric is con-nected to the AHB bus and serves as a slave to the AMBA system A config-urable clock controller generates the clocks for the individual MONTIUMTPs.Every individual MONTIUMTP has its own adjustable clock and runs at itsown speed A prototype chip of the ANNABELLESoC has been produced usingthe Atmel 130 nm CMOS process [8]
The reconfigurable fabric that is integrated in the ANNABELLE SoC
is shown in detail in Figure 11.7 The reconfigurable fabric acts as a
Clock controller
ARM926
GPP
DMA controller
MONTIUM TP
MONTIUM TP
MONTIUM TP
MONTIUM TP CCU
DDC DDC ADC ADC SDRAMSRAM/ controllerIRQ decoderViterbi
External bus interface 5-Layer AMBA advanced high-performance bus
Trang 2Queue 0 Control Queue 1
FIGURE 11.7
The ANNABELLESoC reconfigurable fabric
reconfigurable coprocessor for the ARM926 processor Computationallyintensive DSP algorithms are typically offloaded from the ARM926 proces-sor and processed on the coarse-grained reconfigurable MONTIUMTPs insidethe reconfigurable fabric The reconfigurable fabric contains four MONTIUM
TPs, which are connected via a CCU to a circuit-switched NoC The urable fabric is connected to the AMBA system through a AHB–NoC bridgeinterface Configurations, generated at design-time can be loaded onto the
reconfig-MONTIUMTPs at run-time The reconfigurable fabric provides “block mode”and “streaming mode” computation services
For ASIC synthesis, worst-case military conditions are assumed In ticular, the supply voltage is 1.1 V and the temperature is 125◦C Resultsobtained with the synthesis are as follows:
par-• The area of one MONTIUMcore is 3.5 mm2of which 0.2 mm2 is for theCCU and 3.3 mm2is for the MONTIUMTP (including memory)
• With Synopsys tooling we estimated that the MONTIUMTP, within the
ANNABELLEASIC realization, can implement an FIR filter at about 100MHz or an FFT at 50 MHz The worst-case clock frequency of the
ANNABELLEchip is 25 MHz
• With the Synopsys prime power tool, we estimated the energyconsumption using placed and routed netlists The following sectionprovides some of the results
Trang 311.3.1.4 Average Power Consumption
To determine the average power consumption of the ANNABELLEas accurate
as possible, we performed a number of power estimations on the placedand routed netlist using the Synopsys power compiler Table 11.2 pro-vides the dynamic power consumption in mW/MHz of various MONTIUM
blocks for three well-known DSP algorithms These figures show that theoverhead of the sequencer and decoder is low:<16% of the total dynamic
power consumption Finally, Table 11.3 compares the energy consumption
of the MONTIUMand the ARM926 on ANNABELLE For the FIR-5 algorithm thememory is not used
11.3.1.5 Locality of Reference
As mentioned above, locality of reference is an important design ter One of the reasons for the excellent energy figures of the MONTIUMis theuse of locality of reference To illustrate this, Table 11.4 gives the amount ofmemory references local to the cores compared to the amount of off-corecommunications These figures are, as expected, algorithm dependent.Therefore, in this table, we chose three well-known algorithms in the
Trang 4parame-TABLE 11.4
Internal and External Memory References per Execution of
an Algorithm
Number of Memory References
1024p FFT 51200 4096 12.5
200 tap FIR 405 2 202.5
SISO algorithm (N softbits) 18*N 3*N 6
TABLE 11.5
Reconfiguration of Algorithms on the MONTIUM
1024p FFT Scaling factors ≤150 bit ≤10
to iFFT Twiddle factors 16384 bit 512
200 tap FIR Filter coefficients ≤3200 bit ≤80
streaming DSP application domain: a 1024p FFT, a 200 tap FIR filter, and
a part of a Turbo decoder (SISO algorithm [17]) The results show that forthese algorithms 80%–99% of the memory references are local (within a tile)
11.3.1.6 Partial Dynamic Reconfiguration
One of the advantages of a multicore SoC organization is that each ual core can be reconfigured while the other cores are operational In the
individ-MONTIUM, the configuration memory is organized as a RAM memory Thismeans that to reconfigure the MONTIUM, the entire configuration memoryneed not be rewritten, but only the parts that are changed Furthermore,because the MONTIUM has a coarse-grained reconfigurable architecture, theconfiguration memory is relatively small The MONTIUMhas a configurationsize of only 2.6 kB Table 11.5 gives some examples of reconfigurations
To reconfigure a MONTIUMfrom executing a 1024 point FFT to executing
a 1024 point inverse FFT requires updating the scaling and twiddle factors.Updating these factors requires less than 522 clock cycles in total To changethe coefficients of a 200 tap FIR filter requires less than 80 clock cycles
11.3.2 Aspex Linedancer
The Linedancer [4] is an “associative” processor and it is an example of ahomogeneous SoC Associative processing is the property of instructions toexecute only on those PEs where a certain value in their data register matches
a value in the instruction Associative processing is built around an ligent memory concept: content addressable memory (CAM) Unlike stan-dard computer memory (random access memory or RAM) in which the user
Trang 5intel-supplies a memory address and the RAM returns the data word stored at thataddress, a CAM is designed such that the user supplies a data word and theCAM searches its entire memory to see if that data word is stored anywhere
in it If the data word is found, the CAM returns a tag list of one or morestorage addresses where the word was found Each CAM line, that contains
a word, can be seen as a processor element (PE) and each tag list element as
a 1 bit condition register Dependending on this register, the aggregate ciative processor can either instruct the PEs to continue processing on theindicated subset, or to return the involved words subsequently for furtherprocessing There are several implementations possible which vary from bitserial to word parallel, but the latest implementations [4,5] can perform theinvolved lookups in parallel in a single clock cycle
asso-In general the Linedancer belongs to the subclass of massively parallelSIMD architectures, with typically more than 512 processors This SIMD sub-class is perfectly suited to support data parallelism, for example, for signal,image, and video processing; text retrieval; and large databases The asso-ciative functions furthermore allow the processor to function like an intelli-gent memory (CAM), permitting high speed searching and data-dependentimage processing operations (such as median filters and object recognition/labeling)
The so called “ASProCore” of the Linedancer, is designed around a verylarge number—up to 4,096—of simple PEs arranged in a line, see Figure 11.8.Application areas are diverse but have in common the simple process-ing of very large amounts of data, from samples in 1D-streams to pixels in2D or 3D-images To mention a few: software defined radio (e.g., WiMAX),broadcast (Video compression), medical imaging (3D reconstruction), and inhigh-end printers—in particular for raster image processing (RIP)
On-chip or off-chip memory
FIGURE 11.8
The scalable architecture of Linedancer
Trang 664 3
Inter-Bulk IO memory
Data
LLP
FIGURE 11.9
The architecture of Linedancer’s associative string processor (ASProCore)
In the following sections, the associative processor (ASProCore) and theLinedancer family are introduced At the end, we present the developmenttool chain and a brief conclusion on the Linedancer application domain
11.3.2.1 ASProCore Architecture
Each PE has a 1-bit ALU, 32–64 bit full associative memory array, and 128bit extended memory See Figure 11.9 for a detailed view on the ASProCorearchitecture The processors are connected in a 1D network, actually a 4Kbit shift register, in between the indicated “left link port” (LLP) and “rightlink port” (RLP) The network allows data to be shared between PEs withminimum overhead The ASProCore also has a separate word serial bit par-allel memory, the primary data store (PDS), for high-speed data input Theon-chip DMA engine automatically translates 2D and 3D images into the 1Darray (and passed through via the PDS) The 1D architecture allows for linearscaling of performance, memory, and communication, provided the applica-tion is expressed in a scalable manner The Linedancer features also a single
or dual bit RISC core (P1, HD, respectively) for sequential processing andcontrolling the ASProCore
11.3.2.2 Linedancer Hardware Architecture
The current Linedancers, the P1 and the HD, have been realized in 0.13 μmCMOS process Both have one or two 32-bit SPARC core(s) with 128 kB inter-nal program memory System clock frequencies vary from 300, 350, 400 MHz.The Linedancer-P1 integrates an associative processor (ASProCore, with 4KPEs), a single SPARC core with a 4 kB instruction cache, and a DMA con-troller capable of transferring 64 bit at 66 MHz over a PCI-interface, as shown
in Figure 11.10
It further hosts 128 kB internal data memory The chip consumes 3.5 Wtypical at 300 MHz The Linedancer-HD integrates two associative proces-sors (2× 2K PEs), two SPARC cores with each 8 kB instruction cache and
Trang 7DMA engine
32 bit RISC CPU
128 kB RAM
External DRAM (prog)
External DRAM (data)
Data
FIGURE 11.10
The Linedancer-P1 layout
4 kB data cache, four internal DMA engines, and an external data channelcapable of transferring 64 bit at 133 MHz over a PCI-X interface, as shown
in Figure 11.11 The ASProCore has been extended with a chordal ringinter-PE communication network that allows for faster 2D- and 3D-imageprocessing It further hosts four external DDR2 DRAM interfaces, eightdedicated streaming data I/O ports (up to 3.2 GB/s), and 1088 kB internaldata memory The chip consumes 4.5 W typical at 300 MHz
11.3.2.3 Design Methodology
The software development environment for Linedancer consists of a piler, linker, and debugger The Linedancer is programmed in C, with someparallel extensions to support the ASProCore processing array The toolchain
com-is based on the GNU compiler framework, with dedicated pre and cessing tools to compile and optimise the parallel extensions to C
postpro-Associative SIMD processing adds an extra dimension to massive parallelprocessing, enabling new views on problem modeling and the subsequentimplementation (for example, in searching/sorting and data-dependentimage processing) The Linedancer’s 1D-architecture scales better than a 2Darray often used in multi-ALU arrays as PACT’s XPP [6] or the Tilera’s 2Dmulticore array [7] Because of the large size of the array, power consumption
is relatively high compared to the MONTIUMprocessor and prevents tion into handheld devices
Trang 8applica-GPIO PCI-X
Control
External data DRAM (4 banks)
JTAG Program
memory
Program memory
V7 ASProCore 4,096 processing elements
1 Mbit storage
8 × Direct data interfaces
4 × DMA engines
Internal data memory
32 bit RISC CPU
32 bit RISC CPU
11.3.3.1 Architecture
The XPP architecture is based on a hierarchical array of coarse-grained, tive computing elements, called processing array elements (PAEs) The PAEare clustered in processing array clusters (PACs) All PAEs in the XPP archi-tecture are connected through a packet-oriented communication network.Figure 11.12 shows the hierarchical structure of the XPP array and the PAEsclustered in a PAC
adap-Different PAEs are identified in the XPP array: “ALU-PAE, RAM-PAE,”and “FNC-PAE.” The ALU-PAE contains a multiplier and is used for DSPoperations The RAM-PAE contains a RAM to store data The FNC-PAE is aunique sequential VLIW-like processor core The FNC-PAEs are dedicated tothe control flow and sequential sections of applications Every PAC containsALU-PAEs, RAM-PAEs, and FNC-PAEs The PAEs operate according to adata flow principle; a PAE starts processing data as soon as all required inputpackets are available If a packet cannot be processed, the pipeline stalls untilthe packet is received
Trang 9SCM CM
PAC
PAC
PAC IO
RAM RAM RAM RAM
RAM RAM RAM RAM
FNC FNC FNC FNC
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
IO
CM
FIGURE 11.12
The structure of an XPP array composed of four PACs (From Baumgarte, V
et al., J Supercomput., 26(2), 167, September 2003.)
Each PAC is controlled by a configuration manager (CM) The CM isresponsible for writing configuration data into the configurable object of thePAC Multi-PAC XPP arrays contain additional CMs for concurrent con-figuration data handling, arranged in a hierarchical tree of CMs The top
CM, called supervising CM (SCM), has an external interface, not shown inFigure 11.12, that connects the supervising CM to an external configurationmemory
11.3.3.2 Design Methodology
DSP algorithms are directly mapped onto the XPP array according to theirdata flow graphs The flow graph nodes define the functionality and oper-ations of the PAEs, whereas the edges define the connections betweenthe PAEs The XPP array is programmed using the native mapping lan-guage (NML), see [20] In NML descriptions, the PAEs are explicitly allo-cated and the connections between the PAEs are specified Optionally, theallocated PAEs are placed onto the XPP array NML also includes statements
Trang 10to support configuration handling Configuration handling is an explicit part
of the application description
A vectorizing C compiler is available to translate C functions to NMLmodules The vectorizing compiler for the XPP array analyzes the code fordata dependencies, vectorizes those code sections automatically, and gener-ates highly parallel code for the XPP array The vectorizing C compiler istypically used to program “regular” DSP operations that are mapped on
“ALU-PAEs” and “RAM-PAEs” of the XPP array Furthermore, a grained parallelization into several FNC-PAE threads is very useful when
coarse-“irregular” DSP operations exist in an application This allows running evenirregular, control-dominated code in parallel on several FNC-PAEs TheFNC-PAE C compiler is similar to a conventional RISC compiler extendedwith VLIW features to take advantage of ILP within the DSP algorithms
11.3.4 Tilera
The Tile64 [7] is a TP based on the mesh architecture that was originallydeveloped for the RAW machine [26] The chip consists of a grid of processortiles arranged in a network (see Figure 11.13), where each tile consists of aGPP, a cache, and a nonblocking router that the tile uses to communicatewith the other tiles on the chip
The Tilera processor architecture incorporates a 2D array of homogenous,general-purpose cores Next to each processor there is a switch that connects
Trang 11the core to the iMesh on-chip network The combination of a core and aswitch form the basic building block of the Tilera Processor: the tile Eachcore is a fully functional processor capable of running complete operatingsystems and off-the-shelf “C” code Each core is optimized to provide a highperformance/power ratio, running at speeds between 600 MHz and 1 GHz,with power consumption as low as 170 mW in a typical application Eachcore supports standard processor features such as
• Full access to memory and I/O
• Virtual memory mapping and protection (MMU/TLB)
• Hierarchical cache with separate L1-I and L1-D
• Multilevel interrupt support
• Three-way VLIW pipeline to issue three instructions per cycle
The cache subsystem on each tile consists of a high-performance, level, non-blocking cache hierarchy Each processor/tile has a split level 1cache (L1 instruction and L1 data) and a level 2 cache, keeping the design,fast and power efficient When there is a miss in the level 2 cache of a spe-cific processor, the level 2 caches of the other processors are searched for thedata before external memory is consulted This way, a large level 3 cache isemulated
two-This promotes on-chip access and avoids the bottleneck of off-chip globalmemory Multicore coherent caching allows a page of shared memory,cached on a specific tile, to be accessed via load/store references to othertiles Since one tile effectively prefetches for the others, this technique canyield significant performance improvements
To fully exploit the available compute power of large numbers of cessors, a high-bandwidth, low-latency interconnect is essential The net-work (iMesh) provides the high-speed data transfer needed to minimizesystem bottlenecks and to scale applications iMesh consists of five distinctmesh networks: Two networks are completely managed by hardware andare used to move data to and from the tiles and memory in the event ofcache misses or DMA transfers The three remaining networks are availablefor application use, enabling communication between cores and betweencores and I/O devices A number of high-level abstractions are supplied foraccessing the hardware (e.g., socket-like streaming channels and message-passing interfaces.) The iMesh network enables communication withoutinterrupting applications running on the tiles It facilitates data transferbetween tiles, contains all of the control and datapath for each of the net-work connections, and implements buffering and flow control within all thenetworks
pro-11.3.4.1 Design Methodology
The TILE64 processor is programmable in ANSI standard C and C++ Tilescan be grouped into clusters to apply the appropriate amount of processingpower to each application and parallelism can be explicitly specified
Trang 1211.4 Conclusion
In this chapter, we addressed reconfigurable multicore architectures forstreaming DSP applications Streaming DSP applications express computa-tion as a data flow graph with streams of data items (the edges) flowingbetween computation kernels (the nodes) Typical examples of streamingDSP applications are wireless baseband processing, multimedia processing,medical image processing, and sensor processing These application domainsrequire flexible and energy-efficient architectures This can be realized with amulticore architecture The most important criteria for designing such a mul-ticore architecture are predictability and composability, energy efficiency,programmability, and dependability Two other important criteria are per-formance and flexibility Different types of processing cores have been dis-cussed, from ASICs, reconfigurable hardware, to DSPs and GPPs ASICshave high performance but suffer from poor flexibility while DSPs and GPPsoffer flexibility but modest performance Reconfigurable hardware combinesthe best of both worlds These different processing cores are, together withmemory- and I/O blocks assembled into MP-SoCs MP-SoCs can be clas-sified into two groups: homogeneous and heterogeneous In homogeneousMP-SoCs, multiple cores of a single type are combined whereas in a hetero-geneous MP-SoC, multiple cores of different types are combined
We also discussed four different architectures: the MONTIUM/ANNABELLE
SoC, the Aspex Linedancer, the PACT-XPP, and the Tilera processor The
MONTIUM, a coarse-grain, run-time reconfigurable core has been used as one
of the building blocks of the ANNABELLE SoC The ANNABELLE SoC can beclassified as a heterogeneous MP-SoC The Aspex Linedancer is a homoge-neous MP-SoC where a single instruction is executed by multiple processorssimultaneously (SIMD) The PACT-XPP is an array processor where multi-ple ALUs are combined in a 2D structure The Tilera processor is an example
of a homogeneous MIMD MP-SoC
References
1 The International Technology Roadmap for Semiconductors, ITRSRoadmap 2003 Website, 2003 http://public.itrs.net/Files/2003ITRS/Home2003.htm
2 A coarse-grained reconfigurable architecture template and its lation techniques PhD thesis, Katholieke Universiteit Leuven, Leuven,Belgium, January 2005
compi-3 Nvidia g80, architecture and gpu analysis, 2007
Trang 134 Aspex Semiconductor: Technology Website, 2008 semi.com/q/technology.shtml.
http://www.aspex-5 Mimagic 6+ Enables Exciting Multimedia for Feature Phones site, 2008 http://www.neomagic.com/product/MiMagig6+_Product_Brief.pdf/
Web-6 PACT http://www.pactxpp.com/main/index.php, 2008
7 Tilera Corporation http://www.tilera.com/, 2008
8 Atmel Corporation ATC13 Summary http://www.atmel.com, 2007
9 A Banerjee, P.T Wolkotte, R.D Mullins, S.W Moore, and Gerard J.M.Smit An energy and performance exploration of network-on-chip archi-
tectures IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
17(3): 319–329, March 2009
10 V Baumgarte, G Ehlers, F May, A Nückel, M Vorbach, and M hardt PACT XPP—A self-reconfigurable data processing architecture
Wein-Journal of Supercomputing, 26(2):167–184, September 2003
11 M.D van de Burgwal, G.J.M Smit, G.K Rauwerda, and P.M Heysters
Hydra: An energy-efficient and reconfigurable network interface In
Pro-ceedings of the International Conference on Engineering of Reconfigurable tems and Algorithms (ERSA’06), Las Vegas, NV, pp 171–177, June 2006
Sys-12 G Burns, P Gruijters, J Huisken, and A van Wel Reconfigurable
accelerator enabling efficient sdr for low-cost consumer devices In SDR
Technical Forum, Orlando, FL, November 2003
13 A.P Chandrakasan, S Sheng, and R.W Brodersen Low-power cmos
digital design IEEE Journal of Solid-State Circuits, 27(4):473–484, April
1992
14 W.J Dally, U.J Kapasi, B Khailany, J.H Ahn, and A Das Stream
pro-cessors: Progammability and efficiency Queue, 2(1):52–62, 2004.
15 European Telecommunication Standard Institute (ETSI) Broadband Radio
Access Networks (BRAN); HIPERLAN Type 2; Physical (PHY) Layer, ETSI
TS 101 475 v1.2.2 edition, February 2001
16 Y Guo Mapping applications to a coarse-grained reconfigurable tecture PhD thesis, University of Twente, Enschede, the Netherlands,September 2006
archi-17 P.M Heysters, L.T Smit, G.J.M Smit, and P.J.M Havinga Max-log-map
mapping on an fpfa In Proceedings of the 2005 International Conference
on Engineering of Reconfigurable Systems and Algorithms (ERSA’02), LasVegas, NV, pp 90–96, June 2002 CSREA Press, Las Vegas, NV
Trang 1418 P.M Heysters Coarse-grained reconfigurable processors – flexibilitymeets efficiency PhD thesis, University of Twente, Enschede, theNetherlands, September 2004.
19 R.P Kleihorst, A.A Abbo, A van der Avoird, M.J.R Op de Beeck,
L Sevat, P Wielage, R van Veen, and H van Herten Xetal: A low-power
high-performance smart camera processor IEEE International Symposium
on Circuits and Systems, 2001 ISCAS 2001, 5:215–218, 2001
20 PACT XPP Technologies http://www.pactcorp.com, 2007
21 D.C Pham, T Aipperspach, D Boerstler, M Bolliger, R Chaudhry,
D Cox, P Harvey et al Overview of the architecture, circuit design, and
physical implementation of a first-generation cell processor IEEE Journal
of Solid-State Circuits, 41(1):179–196, January 2006
22 G.K Rauwerda, P.M Heysters, and G.J.M Smit Towards software
defined radios using coarse-grained reconfigurable hardware IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 16(1):3–13,January 2008
23 Recore Systems http://www.recoresystems.com, 2007
24 G J M Smit, A B J Kokkeler, P T Wolkotte, and M D van de Burgwal.Multi-core architectures and streaming applications In I Mandoiu and
A Kennings (editors), Proceedings of the Tenth International Workshop on
System-Level Interconnect Prediction (SLIP 2008), New York, pp 35–42,April 2008 ACM Press, New York
25 S.R Vangal, J Howard, G Ruhl, S Dighe, H Wilson, J Tschanz, D Finan
et al An 80-tile sub-100-w teraflops processor in 65-nm cmos IEEE
Jour-nal of Solid-State Circuits, 43(1):29–41, January 2008
26 E Waingold, M Taylor, D Srikrishna, V Sarkar, W Lee, V Lee, J Kim
et al Baring it all to software: Raw machines Computer, 30(9):86–93,
September 1997