In five parts, the book covers: • Architecture and design flow solutions, including the MORA framework for field programmable gate array FPGA programming, a synchronous data flow SDF-bas
Trang 1The advent of the multicore processor era has impacted many areas
of computer architecture design, from memory management and
thread scheduling to inter-processor communication, debugging, power
management, and more Multicore Technology: Architecture,
Reconfiguration, and Modeling gives a holistic overview of the
field of multicore architectures to guide readers interested in further
research and development Featuring contributions by researchers
from renowned institutes around the world, this book explores a broad
range of topics It brings together a variety of perspectives on multicore
embedded systems and identifies the key technical challenges that
are being faced
In five parts, the book covers:
• Architecture and design flow solutions, including the MORA
framework for field programmable gate array (FPGA) programming, a
synchronous data flow (SDF)-based design flow, and an asymmetric
multi-processor system-on-chip (MPSoC) framework called SESAM
• Work being done on parallelism and optimization, including an
exten-sion to atomic verifiable operation (AVOp) streams to support loops
and a mechanism for accelerated critical sections (ACS) to reduce
performance degradation due to critical sections
• Tools for memory systems, including a multicore design space
exploration tool called TMbox and techniques for more efficient
shared memory architectures and scheduling
• Network-on-chip (NoC) issues, with coverage of interconnects;
routing topologies, router architecture, switching techniques, flow
control, traffic patterns, and routing algorithms; a comparison
between mesh- and tree-based NoCs in 3D systems-on-chip;
and a proposed performance evaluation method
A comprehensive survey of state-of-the-art research in multicore
processor architectures, this book is also a valuable resource for
anyone developing software and hardware for multicore systems
www.Ebook777.com
Trang 3Embedded Multi-Core Systems
Series Editors
Fayez Gebali and Haytham El Miligi
University of Victoria Victoria, British Columbia
Multicore Technology: Architecture, Reconfiguration, and Modeling,
edited by Muhammad Yasir Qadri and Stephen J Sangwine
Autonomic Networking-On-Chip: Bio-Inspired Specification, Development,
and Verification, edited by Phan Cong-Vinh
Bioinformatics: High Performance Parallel Computer Architectures,
edited by Bertil Schmidt
Multi-Core Embedded Systems, Georgios Kornaros
Trang 4CRC Press is an imprint of the
Taylor & Francis Group, an informa business
Boca Raton London New York
Trang 5CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2014 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S Government works
Version Date: 20130624
International Standard Book Number-13: 978-1-4398-8064-7 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
transmit-For permission to photocopy or use material electronically from this work, please access www.copyright com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
www.Ebook777.com
Trang 6List of Figures vii
1 MORA: High-Level FPGA Programming Using a Many-Core
Wim Vanderbauwhede, Sai Rahul Chalamalasetti, and Martin Margala
2 Implementing Time-Constrained Applications on a
Sander Stuijk, Akash Kumar, Roel Jordans, and Henk Corporaal
3 SESAM: A Virtual Prototyping Solution to Design Multicore
Nicolas Ventroux, Tanguy Sassolas, Alexandre Guerre, and CaaliphAndriamisaina
4 Verified Multicore Parallelism Using Atomic Verifiable
Michal Dobrogost, Christopher Kumar Anand, and Wolfram Kahl
5 Accelerating Critical Section Execution with Multicore
M Aater Suleman and Onur Mutlu
v
Trang 7III Memory Systems 169
6 TMbox: A Flexible and Reconfigurable Hybrid Transactional
Prasun Ghosal and Tuhin Subhra Das
Mohammad Ayoub Khan and Abdul Quaiyum Ansari
13 Network-on-Chip Performance Evaluation Using an
Sahar Foroutan, Abbas Sheibanyrad, and Fr´ed´eric P´etrot
Trang 81.1 MORA-C++ tool chain 7
1.2 MORA reconfigurable cell (RC) 8
1.3 Block diagram of the MORA PE 9
1.4 MORA control unit flow chart 10
1.5 MORA address generator 11
1.6 Reconfigurable cell floating-point core 28
1.7 Vector support for RC architecture 30
1.8 Shared memory access interface architecture 31
1.9 ADG diagram for the DCT small algorithm 34
1.10 Slice count for benchmark algorithms 35
1.11 Effect of vectorization on Slice/BRAM counts 36
1.12 Throughput versus number of lanes for 8-bit benchmarks 37
2.1 SDF3/MAMPS design flow 43
2.2 Example SDFG and implementation of actor P 45
2.3 MAMPS platform architecture 49
2.4 MAMPS scheduling code for a processing element 50
2.5 Overview of the SDF3 mapping flow 53
2.6 Dataflow model for interconnect communication in MAMPS 54 2.7 Two tile MAMPS platform in XPS 56
2.8 The SDF graph for the MJPEG decoder 57
2.9 Measured and guaranteed worst-case throughput 58
3.1 SESAM overview 65
3.2 SESAM infrastructure 67
3.3 Timed TLM versus approximate-timed TLM 68
3.4 Routing example for a mesh network 69
3.5 SESAM programming model 74
3.6 Structure of the debugging solution implemented in SESAM 78 3.7 PowerArchC: Power model generation flow 82
3.8 PowerArchC: Power-aware ISS architecture generation 83
3.9 DPM and DVFS techniques timing issues 84
3.10 Summary of buffer monitors and scheduling implications 86
3.11 SESAM exploration tool and environment 88
3.12 SESAM AGP toolchain 89
3.13 Example of automatic parallelization 90
vii
Trang 93.14 Parallelization of SESAM simulations 92
3.15 SCMP architecture 93
3.16 Evaluation of SESAM accuracy 96
3.17 SESAM simulation speed 97
3.18 Network performance results 98
3.19 SCMP performance profiling with a variable number of PE 99 3.20 Power aware scheduling results with the WCDMA application 101 4.1 Locally sequential program 117
4.2 Φ is defined for other cores 119
4.3 Visualization of Φ dependency 119
4.4 Φ map at instruction SendSignal s3→ c3 120
4.5 Φ map at instruction W aitSignal s3 121
4.6 Example using the Loop AVOp 125
4.7 Example of an unrolled loop 125
4.8 Example with nested loops 126
4.9 Non-rewritable loop example 132
4.10 Motivating example for loop rewriting 134
4.11 Motivating example unrolled 135
4.12 Effects of an inner loop 138
4.13 Loop with rewriting verified without fully unrolling 139
4.14 Rewritable loop unrolled into a loop without a rewrite 139
4.15 Defining diagram for projected rewrite 145
4.16 Accessing global memory 146
5.1 Amdahl’s serial part, parallel part, and critical section in a multithreaded 15-puzzle kernel 156
5.2 Accelerated Critical Sections (ACS) 158
5.3 Source code and its execution: baseline and ACS 159
5.4 Execution time when number of threads is optimal for each application 162
5.5 Execution time when number of threads equals number of contexts 162
5.6 Speedup over a single small core 163
5.7 ACS versus TLR performance 165
6.1 An 8-core TMbox infrastructure 177
6.2 TMbox MIPS assembly for atomic{a++} 181
6.3 Cache state diagram 182
6.4 Eigenbench results on 1–16 cores 185
6.5 SSCA2 benchmark results on 1–16 cores 186
6.6 Intruder benchmark results on 1–16 cores 186
7.1 Hybrid migration/remote-access architecture 196
7.2 Efficient execution migration in a five-stage CPU core 197
7.3 Average memory latency costs 200
Trang 10List of Figures ix
7.4 Parallel completion time under different DirCC protocols 206
7.5 Cache hierarchy miss rates at various cache sizes 207
7.6 The performance of DirCC (under a MOESI protocol) 210
7.7 Cache hierarchy miss rates for EM2 and RA designs 212
7.8 Non-local memory accesses in RA baseline 213
7.9 Per-benchmark core miss rates 214
7.10 Core miss rates handled by remote accesses 214
7.11 The performance of EM2and RA variants relative to DirCC 215 7.12 Dynamic energy usage for all EM2 and RA variants 216
7.13 EM2 performance scales with network bandwidth 217
8.1 Accuracy of basic Estimate-M method on a dual-core system 230 8.2 Occupancy and estimation error 231
8.3 Two pairs of co-runners in dual-core systems 232
8.4 Cache occupancy for four co-runners in a quad-core system 233 8.5 Occupancy estimation for an over-committed quad-core sys-tem (Part 1) 234
8.6 Occupancy estimation for an over-committed quad-core sys-tem (Part 2) 235
8.7 Fine-grained occupancy estimation in an over-committed quad-core system 235
8.8 Effect of memory bandwidth contention on the MPKC miss-rate curve for the SPEC CPU2000 mcf workload 237
8.9 Miss-ratio curves (MRCs) for various SPEC CPU workloads, obtained online by CAF´E versus offline by page coloring 240
8.10 MRC for mcf with different co-runners 241
8.11 Vtime compensation 246
8.12 Cache divvying occupancy prediction 249
8.13 Co-runner placement 250
9.1 Remote debugging scenario – software view 260
9.2 Debugging multiple cores through IEEE 1149.1 (JTAG) 264
9.3 Debugging a single-core SoC through In-Circuit Emulation 265 9.4 Debugging through trace generation and ICE 267
9.5 Example: Creating a tracepoint in the GDB debugger 268
9.6 Trace-based debugging scenario 275
9.7 Trace compression scheme 275
9.8 Finite context method 277
9.9 Huffman tree for prefix encoding 279
10.1 Side view of multipath interconnect 287
10.2 Network-on-chip concept 289
10.3 Reduction of interconnect length from 2D ICs to 3D ICs 292
10.4 Schematic representation of TSV first, middle, and last pro-cesses 293
www.Ebook777.com
Trang 1110.5 Schematic of photonic interconnect using micro ring
res-onators 293
10.6 A simple schematic of a micro ring resonator 294
10.7 A photonic switch and a non-blocking photonic router 295
10.8 Torus topology 296
10.9 Concentrated mesh topology and wireless routers 297
11.1 Factors affecting the performance of an NoC 301
11.2 Mesh topology 302
11.3 Torus 302
11.4 Folded torus 303
11.5 Octagon 303
11.6 Star 303
11.7 Binary tree 303
11.8 Butterfly 304
11.9 Butterfly fat tree 304
11.10 Honeycomb 304
11.11 Mesh-of-tree 305
11.12 Diametric 2D mesh 305
11.13 Diametric 2D mesh of tree 306
11.14 A 9× 9 structural diametrical 2D mesh 307
11.15 A 9× 9 star type topology 307
11.16 Custom mesh topology 308
11.17 3D irregular mesh 310
11.18 Dragonfly topology 312
11.19 Wireless mesh 314
11.20 MORFIC (mesh overlaid with RF interconnect) 316
11.21 Hybrid ring 316
11.22 Hybrid star 316
11.23 Hybrid tree 317
11.24 Hybrid irregular topology 317
11.25 A typical router architecture 317
11.26 Router data flow 318
11.27 Different routing policies 322
11.28 West first turn 326
11.29 North last turn 326
11.30 Negative first turn 326
12.1 Classification of interconnection networks 336
12.2 Basic network topologies 337
12.3 Diameter in a connected graph 338
12.4 3-D mesh and torus topologies 340
12.5 Binary tree 342
12.6 Proposed topology with different levels (l = 1, 2, and 3) 343
12.7 Ring based tree topology 343
Trang 1212.8 Layout of the proposed topology 34712.9 Number of nodes in level l 34812.10 Degree and diameter analysis of the proposed topology 34912.11 3-D tree mesh 35013.1 Operational layered concept of an NoC-based SoC architec-ture 36113.2 The relation between NoC layers and levels of abstractionfrom a performance evaluation viewpoint 36313.3 A generic design flow for an NoC-based system 36413.4 Performance requirements versus performance analysis 36813.5 Optimization loop: architectural exploration and mapping ex-ploration 37313.6 At each router of the path, disrupting packets appear proba-bilistically in front of the tagged packet 39013.7 Dependency trees corresponding to the latency (a) ‘core tosouth’ and (b) ‘core to east’ of r3,4 in a 6× 5 2D mesh NoCwith the x-first routing algorithm 39213.8 Router delay model related to a 2D mesh NoC 39313.9 The average number of accumulated flits in the output buffer
at the arrival of Pi when there is no header contention 39713.10 The order of delay component computation in one iteration 39713.11 Buffer occupancy caused by Pj at time instant (a) t and (b)
t + 3 when Pj is transferred and Pi can be written into thebuffer 39813.12 Iterative computation for inputs {1, 2, 3, 4} of router r 40013.13 Latency/load curves for the path r2,4 → r4,2 with bufferlengths in flits as indicated and uniform traffic (path latencyexcludes the source queue waiting time) 40213.14 Latency/load curves for the path r2,4 → r4,2 with bufferlengths in flits as indicated and localized traffic 40313.15 Analytical method for different buffer lengths and 0.01% of-fered load steps 40413.16 The average utilization of buffer r3,4 → r4,4 under two trafficdistributions 405
Trang 141.1 Utilization Results of Single Precision Floating-Point Core on
Virtex 4 LX200 27
1.2 Latency of Shared Memory Interface Modules 32
1.3 Benchmark Implementation Results (No Vectorization) 35
1.4 Benchmark Throughput Results for a Single DMA Channel without RC Vectorization 35
1.5 Benchmark Throughput Results with Multiple DMA Chan-nels and RC Vectorization 37
1.6 DCT Benchmark Throughput Comparison 39
2.1 Designer Effort 59
3.1 Hardware Abstraction Layer of SESAM 76
3.2 Basic Remote Protocol Support Commands 79
3.3 Additional Remote Protocol Commands for Fast Debugging 79 5.1 Best Number of Threads for Each Configuration 164
6.1 LUT Occupation of Components of the Honeycomb Core 176
6.2 HTM Instructions for TMbox 179
6.3 TM Benchmarks Used 184
7.1 Various Parameter Settings for the Analytical Cost Model for the ocean contiguous Benchmark 201
7.2 System Configurations Used 202
7.3 Area and Energy Estimates 208
7.4 Synthetic Benchmark Settings 210
9.1 Example of CAE with 16-Bit Addresses 274
9.2 Address Encoding Scheme 278
9.3 Example of Differential Address Encoding – 16-Bit Addresses 278 11.1 Relative Comparison of 2D Irregular Topologies 309
11.2 Comparison of Optical Network Topologies 313
11.3 The Wavelength Assignment of 4-WRON 330
12.1 Classification of NoC Topology 335
xiii
Trang 1512.2 Analysis of Network Parameters for Base Module 34713.1 Characteristics of Analytical Methods 38713.2 Parameters of the Analytical Performance Evaluation Method 38913.3 Comparing Simulation and Analytical Tool Runtimes 406
Trang 16Multicore processor architectures are now mainstream even in applicationssuch as mobile or portable telephones For decades, computer architectureevolved through increases in the size and complexity of processors, and re-ductions in their cost and energy consumption Eventually, however, therecame a point where further increases in complexity of a single processor wereless desirable than providing multiple cores on the same chip The advent
of the multicore era has altered many concepts relating to almost all of theareas of computer architecture design including core design, memory manage-ment, thread scheduling, application support, inter-processor communication,debugging, power management, and many more This book provides a point
of entry into the field of multicore architectures, covering some of the mostresearched aspects
What to look for in it
This book is targeted not only to give readers a holistic overview of the fieldbut also to guide them to further avenues of research by covering the state-of-the-art in this area The book includes contributions from renowned institutesacross the globe with authors from the following institutes contributing to thebook (ordered alphabetically):
Barcelona Supercomputing Center, Spain
Bengal Engineering and Science University, Shibpur, India
Boston University, Boston, USA
CEA LIST, Embedded Computing Lab, France
Eindhoven University of Technology, The Netherlands
Google Inc., USA
Jamia Millia Islamia (Central University), New Delhi, India
Laboratoire TIMA, Grenoble, France
Massachusetts Institute of Technology, USA
McGill University, Montreal, Canada
McMaster University, Canada
National University of Singapore, Singapore
xv
Trang 17University of Glasgow, Glasgow, UK
University of Massachusetts, Lowell, MA, USA
University of Texas at Austin, USA
VMware Inc., USA
The book is divided into five parts: Architecture and Design Flow, allelism and Optimization, Memory Systems, Debugging, and Networks-on-Chip
Par-The contents of each section are discussed in the following
Architecture and Design Flow
This part contains three chapters
Chapter 1, MORA: High-Level FPGA Programming Using a Core Framework, presents an overview of the MORA framework, a high-level programmable multicore FPGA system based on a dataflow network
Many-of Processors-in-Memory The MORA framework is targeted to simplifydataflow-based FPGA programming in C++ using a dedicated ApplicationProgrammer’s Interface (API) The authors demonstrate an image processingapplication implemented using over a thousand cores
Chapter 2, Implementing Time-Constrained Applications on a PredictableMultiprocessor System-on-Chip, presents a Synchronous Data Flow (SDF)-based design flow that instantiates different architectures using a template.The proposed design flow can generate an implementation of an application on
a MPSoC while providing throughput guarantees to the application Thereforethe platform presented supports fast design space exploration for real-timeembedded systems and is also extendable to heterogeneous applications.Chapter 3, SESAM: A Virtual Prototyping Solution to Design Multi-core Architectures for Dynamic Applications, presents an asymmetric MP-SoC framework called SESAM The MPSoC exploration environment can beused for a complete MPSoC design flow It can help the design and sizing
of complex architectures, as well as the exploration of application parallelism
on multicore platforms, to optimize the area and performance efficiency ofembedded systems SESAM can integrate various instruction set simulators
at the functional or cycle-accurate level, as well as different networks-on-chip,DMA, a memory management unit, caches, memories, and different controlsolutions to schedule and dispatch tasks The framework also supports theenergy modeling of the MPSoC design
Parallelism and Optimization
This part contains two chapters
Chapter 4, Verified Multicore Parallelism using Atomic Verifiable tions, presents an extension to Atomic Verifiable Operation (AVOp) streams tosupport loops An AVOp is the basic instruction in the Domain Specific Lan-
Trang 18Opera-guage (DSL) proposed by the authors AVOp streams allow performance to bemaximized by introducing an algorithm for scheduling across different threads
of execution so as to minimize contention in a synchronous operation The thors also present a verification algorithm that guarantees hazard avoidancefor any possible execution order This framework enables a programmer toexpress complex communication patterns and hide communication latencies
au-in an approach similar to software pipelau-inau-ing of loops
Chapter 5, Accelerating Critical Section Execution with AsymmetricMulti-Core Architectures, presents a mechanism for Accelerated Critical Sec-tions (ACS) to reduce performance degradation due to critical sections Criti-cal sections are those sections of code that access mutually shared data amongthe cores The principle of Mutual Exclusion dictates that threads cannot beallowed to update shared data concurrently; thus, accesses to shared data areencapsulated inside critical sections This in effect can serialize threads, andreduce performance and scalability Therefore, in order to avoid this perfor-mance loss, the authors propose acceleration of critical sections on a high-performance core of an Asymmetric Chip Multiprocessor (ACMP)
Memory Systems
Chapter 6, TMbox: A Flexible and Reconfigurable Hybrid Transactional ory System, presents a multicore design space exploration tool called TMbox.This flexible experimental systems platform is based on an FPGA and offers
Mem-a scMem-alMem-able Mem-and high-performMem-ance multiprocessor System-on-Chip (SoC) mentation that is configurable for integrating various Instruction Set Archi-tecture (ISA) options and hardware organizations Furthermore, the proposedplatform is capable of executing operating systems and has an extensive sup-port for Hybrid Transactional Memory Systems
imple-Chapter 7, EM2: A Scalable Shared Memory Architecture for Large-scaleMulticores, presents a technique to provide deadlock-free migration-based co-herent shared memory to the Non-Uniform Cache Access (NUCA) family ofarchitectures Using the proposed Execution Migration Machine (EM2), theauthors claim to achieve performance comparable to directory-based archi-tectures without using directories Furthermore, the proposed scheme is bothenergy and area efficient
Chapter 8, CAF´E: Cache-Aware Fair and Efficient Scheduling for CMPs,introduces an efficient online technique for generating Miss Ratio Curves(MRCs) and other cache utility curves that uses hardware performance coun-ters available on commodity processors Based on these monitoring and infer-ence techniques, the authors also introduce methods to improve the fairnessand efficiency of CMP scheduling decisions
Trang 19In Chapter 9, Software Debugging Infrastructure for Multi-Core Chip, the authors present an overview of the existing multithreaded softwaredebugging schemes and discuss challenges that are being faced by the designers
Systems-on-of multi-core systems-on-chip The authors conclude that traditional ging methods are not suitable for debugging concurrently executing multi-threaded software Furthermore, the use of trace generation to complementtraditional debugging methods is gaining traction and is expected to take anincreased role in the debugging of future multi-threaded software However,for trace generation based schemes, transfer of massive amounts of trace dataoff-the-chip for analysis is one of the major problems The authors present
debug-an instruction-address trace compression scheme that aims to mitigate thisproblem
Network-On-Chip
This section contains four chapters Chapter 10, On Chip Interconnects forMulti-Core Architectures, presents a detailed study of the state of the art ininterconnects used in multi-core architectures The technologies discussed bythe authors include Three Dimensional, Photonic, Wireless, RF Waveguide,and Carbon Nanotubes based Interconnects
Chapter 11, Routing in Multi-Core NoCs, presents an overview and survey
of routing topologies, router architecture, switching techniques, flow control,traffic patterns, routing algorithms, and challenges faced by the existing ar-chitectures for on-chip networks
Chapter 12, Efficient Topologies for 3-D Networks-on-Chip, presents acomparison between mesh- and tree-based NoCs in a 3D SoC The authorsconclude that for 3D SoCs both mesh- and tree-based NoCs are capable ofachieving better performance compared to the traditional 2D implementa-tions However, proposed tree-based topologies show significant performancegains in terms of network diameter and degree and number of nodes, andachieve significant reductions in energy dissipation and area overhead withoutany change in throughput and latency
Finally, Chapter 13, Network-on-Chip Performance Evaluation Using anAnalytical Method, presents an analytical performance evaluation method forNoCs that permits an architectural exploration of the network layer for a givenapplication Additionally, for a given network architecture, the method allowsexamination of the performance of different mappings of the application onthe NoC The proposed method is based on the computation of probabilitiesand contention delays between packets competing for shared resources, andprovides a comprehensive delay analysis of the network layer
Trang 20We thank the team at CRC Press: Nora Konopka, Publisher, for supportingour proposal for this book; Kari Budyk and Michele Dimont, for keeping us ontrack and for assisting us promptly and courteously with our many questionsand queries; and Shashi Kumar, for helping us around our LATEX difficulties
We also thank our wives, Dr Nadia N Qadri and Dr Elizabeth Shirley, who,although they have never met, have shared an experience that we inflicted
on them, as we worked long hours editing this book, when we should havebeen spending time with them We have often read such thanks (or indeedapologies) in other books, but now we understand why we must acknowledgetheir contribution to this book Nadia and Elizabeth, thank you both for yoursupport and patience
Muhammad Yasir QadriIslamabad, PakistanStephen J SangwineColchester, United Kingdom
December 2012
www.Ebook777.com
Trang 22Muhammad Yasir Qadri was born in Pakistan in 1979 He graduated fromMehran University of Engineering and Technology in Electronic Engineering.
He obtained his PhD in Electronic Systems Engineering from the School ofComputer Science and Electronic Engineering, University of Essex, UK Hisarea of specialization is energy/performance optimization in reconfigurableMPSoC architectures Before his time at Essex, he was actively involved inthe development of high-end embedded systems for commercial applications
He is an Approved PhD Supervisor by the Higher Education Commission ofPakistan, and is currently working as a Visiting Faculty Member at HITECUniversity, Taxila, Pakistan
Stephen J Sangwine was born in London in 1956 He received a BSc degree
in Electronic Engineering from the University of Southampton, Southampton,
UK, in 1979, and his PhD from the University of Reading, Reading, UK, in
1991, for work on digital circuit fault diagnosis He was a Lecturer in theDepartment of Engineering at the University of Reading from 1985 – 2000, andsince 2001 has been a Senior Lecturer at the University of Essex, Colchester,
UK His interests include color image processing and vector signal processingusing hypercomplex algebras, and digital hardware design and test
xxi
Trang 24Christopher Kumar Anand
Department of Computing and
Abdul Quaiyum Ansari
Department of Electrical Engineering
Jamia Millia Islamia
New Delhi, India
Oriol Arcas
Barcelona Supercomputing Center
Universitat Polit`ecnica de Catalunya
Barcelona, Spain
Sai Rahul Chalamalasetti
Department of Electrical and
Computer Engineering
University of Massachusetts
Lowell, MA, USA
Myong Hyon Cho
Massachusetts Institute of
Technology
Cambridge, MA, USA
Henk Corporaal
Department of Electrical Engineering
Eindhoven University of Technology
Eindhoven, The Netherlands
Adri´an CristalBarcelona Supercomputing CenterCSIC — Spanish National ResearchCouncil
Barcelona, SpainTuhin Subhra DasDepartment of InformationTechnology
Bengal Engineering and ScienceUniversity
Shibpur, IndiaSrinivas DevadasMassachusetts Institute ofTechnology
Cambridge, MA, USAMichal DobrogostDepartment of Computing andSoftware
McMaster UniversityHamilton, Ontario, CanadaSahar Foroutan
TIMA Laboratory, SLS TeamGrenoble, France
Prasun GhosalDepartment of InformationTechnology
Bengal Engineering and ScienceUniversity
Shibpur, India
xxiii
Trang 25Department of Electrical Engineering
Eindhoven University of Technology
Eindhoven, The Netherlands
Wolfram Kahl
Department of Computing and
Software
McMaster University
Hamilton, Ontario, Canada
Mohammad Ayoub Khan
Center for Development of Advanced
Bojan Mihajlovi´cDepartment of Electrical andComputer EngineeringMcGill UniversityMontreal, CanadaOnur MutluDepartment of Electrical andComputer EngineeringCarnegie Mellon UniversityPittsburgh, PA, USAFr´ed´eric P´etrotTIMA Laboratory, SLS TeamGrenoble, France
Soumyajit PoddarSchool of VLSI TechnologyBengal Engineering and ScienceUniversity
Shibpur, IndiaTanguy SassolasEmbedded Computing LabCEA LIST
Gif-sur-Yvette, FranceHamed SheibanyradTIMA Laboratory, SLS TeamGrenoble, France
Keun Sup ShimMassachusetts Institute ofTechnology
Cambridge, MA, USASatnam SinghGoogle, Inc
Mountain View, CA, USA
Trang 26Nehir Sonmez
Barcelona Supercomputing Center
Universitat Polit`ecnica de Catalunya
Barcelona, Spain
Sander Stuijk
Department of Electrical Engineering
Eindhoven University of Technology
Eindhoven, The Netherlands
Barcelona Supercomputing Center
Universitat Polit`ecnica de Catalunya
Gif-sur-Yvette, FranceCarl A Waldspurger(Formerly at) VMware Inc
Palo Alto, CA, USARichard WestDepartment of Computer ScienceBoston University
Boston, MA, USAPuneet ZarooVMware Inc
Palo Alto, CA, USAXiao ZhangGoogle, Inc
Mountain View, CA, USAˇ
Zeljko ˇZili´cDepartment of Electrical andComputer EngineeringMcGill UniversityMontreal, Canada
Trang 27AML Average Memory Latency
API Application Programming
Interface
ASIC Application-Specific
Integrated Circuit
BRAM Block RAM
CABA Cycle Accurate Bit Accurate
CMP Chip Multi-Processor
CPI Cycles Per Instruction
CPU Central Processing Unit
DCT Discrete Cosine Transform
DDR Double Data Rate
DMA Direct Memory Access
DRAM Dynamic RAM
DSP Digital Signal Processor
DWT Discrete Wavelet Transform
ECC Error-Correcting Code
FPGA Field Programmable
Gate Array
FPU Floating-Point Unit
GALS Globally Asynchronous
Locally Synchronous
GHz Gigahertz
GPU Graphics Processing Unit
HAL Hardware Abstraction Layer
HDL Hardware Description
ISA Instruction Set Architecture
ISS Instruction Set Simulator
ITRS International Technology
ChipMRC Miss-Ratio Curve
NI Network InterfaceNoC Network-on-ChipNUCA Non-Uniform Cache AccessNUMA Non-Uniform Memory Ac-
cess
OS Operating SystemOSI Open System Interconnec-
tionPIM Processor In MemoryQoS Quality of Service
RAM Random Access MemoryRISC Reduced Instruction Set
ComputerRTL Register Transfer LevelSDAR Sampled Data Address Reg-
isterSoC System-on-ChipSPEC Standard Performance Eval-
uation CorporationSRAM Static RAMSDRAM Synchronous Dynamic RAMTDM Time Division MultiplexingTLB Translation Look-Aside
BufferTLM Transaction Level Modeling
VC Virtual ChannelVtime Virtual Time
Trang 28Part I
Architecture and Design
Flow
Trang 30MORA: High-Level FPGA Programming Using a Many-Core Framework
Wim Vanderbauwhede
School of Computing Science, University of Glasgow, Glasgow, UK
Sai Rahul Chalamalasetti and Martin Margala
Department of Electrical and Computer Engineering, University of sachusetts, Lowell, MA, USA
Mas-CONTENTS
1.1 Overview of the State of the Art in High-Level FPGA
Programming 41.2 Introduction to the MORA Framework 61.2.1 MORA Concept 61.2.2 MORA Tool Chain 61.3 The MORA Reconfigurable Cell 61.3.1 Processing Element 81.3.2 Control Unit and Address Generator 81.3.3 Asynchronous Handshake 91.3.4 Execution Model 121.4 The MORA Intermediate Representation 121.4.1 Expression Language 131.4.2 Coordination Language 151.4.3 Generation Language 161.4.4 Assembler 171.5 MORA-C++ API 181.5.1 Key Features 191.5.2 MORA-C++ by Example 191.5.3 MORA-C++ Compilation 221.5.4 Floating-Point Compiler (FloPoCo) Integration 271.6 Hardware Infrastructure for the MORA Framework 291.6.1 Direct Memory Access (DMA) Channel Multiplexing 291.6.2 Vectorized RC Support 291.6.3 Shared Memory Access 301.7 Results 331.7.1 Thousand-Core Implementation 33
3
Trang 311.7.2 Results 341.7.3 Comparison with Other DCT Implementations 381.8 Conclusion and Future Work 40
This chapter presents an overview of the current state of the MORA work, a high-level programmable multicore FPGA system based on a dataflownetwork of Processors-in-Memory The aim of the MORA framework is tosimplify dataflow-based FPGA programming while still delivering excellentperformance, by providing a streaming dataflow framework that can be pro-grammed in C++ using a dedicated Application Programmer’s Interface(API) Many of the restrictions common to most other C-to-gates tools donot apply to MORA because of the adoption of processors rather than LUTs
frame-as the smallest unit of the design MORA’s processors are unique frame-as theyare specialised in terms of instruction set, data path width, and memory sizefor the particular section of the program that runs on them As a result, wehave demonstrated an image processing application implemented using over
a thousand cores
The chapter starts with the background and rationale for this work andthe state of the art The subsequent sections discuss in detail the hardwareand software aspects of the MORA framework: architecture, hardware in-frastructure, and tool chain for the FPGA; design of the MORA-C++ API,the Intermediate Representation, compiler, and assembler The final sectionspresent and discuss benchmark results for several streaming data processingalgorithms to demonstrate the performance of the current system, and outlineavenues for future research
FPGA Programming
Media processing architectures and algorithms have come to play a major role
in modern consumer electronics, with applications ranging from basic nication devices to high-level processing machines Therefore architectures andalgorithms that provide adaptability and flexibility at a very low cost havebecome increasingly popular for implementing contemporary multimedia ap-plications Reconfigurable or adaptable architectures are widely being seen
commu-as viable alternatives to extravagantly powerful General Purpose Processors(GPP) as well as tailor-made but costly Application Specific Integrated Cir-
Trang 32cuits (ASICS) Over the last few years, FPGA devices have grown in size andcomplexity As a result, many applications that were previously restricted toASIC implementations can now be deployed on reconfigurable platforms Re-configurable devices such as FPGAs offer the potential of very short designcycles and reduced time to market.
However, with the ever increasing size and complexity of modern timedia processing algorithms, mapping them onto FPGAs using HardwareDescription Languages (HDLs) like VHDL or Verilog provided by many FPGAvendors has become increasingly difficult To overcome this problem severalgroups in academia as well as industry have engaged in developing high-levellanguage support for FPGA programming The most common approaches fallinto three main categories: HLL-to-gates, system builders, and soft processors.The HLL-to-gates design flow starts from a program written in a High-Level Language (HLL, typically a dialect of C) with additional keywordsand/or pragmas, and converts these programs into a Hardware DescriptionLanguage (HDL) such as Verilog or VHDL Examples of commercial tools inthis category are Handel-C (Sullivan, Wilson, and Chappell 2004), Impulse-
mul-C (Santambrogio et al 2007), Xilinx’ AutoESL (mul-Cong 2008), and Maxeler’s/MaxCompiler/ (Howes et al 2006) Academic solutions include Streams-C(Gokhale et al 2000), Trident (Tripp et al 2005), and ROCCC (Buyukkurt,Guo, and Najjar 2006) Despite the advantage of a shorter learning curve forprogrammers to understand these languages, a significant disadvantage of thisC-based coding style is that it is customized to suit Von Neumann processorarchitectures, which cannot fully extract parallelism out of FPGAs
By system builders we mean solutions that will generate complex IP coresfrom a high-level description, often using a wizard Examples are Xilinx’CoreGen and Altera’s Mega wizard These tools greatly enhance productivitybut are limited to creating designs using parameterized predefined IP cores.Graphical tools such as MATLAB-Simulink and NI LabVIEW also fall intothis category
Finally, soft processors have increasingly been seen as strong players inthis category Each FPGA vendor provides its own soft cores such as Microb-laze and Picoblaze from Xilinx and Nios from Altera However, the traditionalarchitectures with shared memory access and mutual memory access are farfrom ideal to exploit the inherent parallelism inherent in FPGAs for mediaprocessing applications To address this problem, different processor architec-tures are needed One such architecture has been proposed and commercialized
by Mitrionics: the ‘Mitrion Virtual Processor’ (MVP) is a massively parallelprocessor that can be customized for the specific programs that run on it (Kin-dratenko, Brunner, and Myers 2007) Other alternatives are processor arrayssuch as proposed by Craven, Patterson, and Athanas (2006), which are based
on the OpenFire processor or the MOLEN reconfigurable processor compiler(Panainte, Bertels, and Vassiliadis 2007)
Trang 331.2 Introduction to the MORA Framework
In this section we introduce the MORA framework We discuss the applicationdomain and rationale for the framework and introduce the main concepts andbuilding blocks of the MORA framework, the MORA system abstraction, andthe adopted approach to high-level FPGA programming
1.2.1 MORA Concept
The MORA framework (Vanderbauwhede et al 2009, 2010) is targeted at theimplementation of high performance streaming algorithms It allows the ap-plication developer to write an FPGA application using a C++ API (MORA-C++) which essentially implements a Communicating Sequential Processes(CSP) paradigm (Hoare 1978) The toolchain converts the program into acompile-time generated network (a directed dataflow graph) where every node
is implemented on a compile-time configurable Processor-in-Memory (PIM),called the MORA Reconfigurable Cell (RC) (Chalamalasetti et al 2009) Each
RC is tailored (in terms of instruction set and memory size) to the specificcode implementing the process As a result, the MORA framework funda-mentally differs from other HLL-to-gates languages as it allows memory-basedconstructs and algorithms (e.g., stack machines and pointer-based data struc-tures)
1.2.2 MORA Tool Chain
Figure 1.1 shows the MORA tool chain, which consists of the MORA-C++compiler, the MORA assembler, and the FPGA back-end (currently targetingthe SGI RC-100 platform)
From the MORA-C++ source code, the compiler emits MORA diate Representation (IR) language, which is transformed into Verilog andimplemented using the Xilinx ISE tools The assembler can also transformthe IR language into a cycle-approximate SystemC model, allowing fast sim-ulation of the design Combined with the ability to compile the source codeusing g++, this provides a powerful development system allowing rapid designiterations
The MORA architecture consists of a compile-time generated network of configurable Cells Although MORA supports access to shared memory, the
Trang 34Re-FIGURE 1.1
MORA-C++ tool chain
memory architecture is distributed: storage for data is partitioned among RCs
by providing each RC with internal data memory As each RC is a in-Memory (PIM), computations can be performed close to memory, therebyreducing memory access time This architecture results in the high memoryaccess bandwidth needed to efficiently support massively parallel computa-tions required in multimedia applications while maintaining generality andscalability of the system
Processor-The external data exchange is managed by an I/O Controller which canaccess the internal memory of the input and output RCs through standardmemory interface instructions The internal data flow is controlled in a dis-tributed manner through a handshaking mechanism between the RCs.Each RC (Figure 1.2) consists of a Processing Element (PE) for arith-metic and logical operations with configurable word size, a small (order of
1 KB) dual-port data memory implemented using FPGA block RAMs, and aControl Unit (CU) with a small instruction memory The PE forms the maincomputational unit of the RC; the control unit synchronises the instructionqueue within the RC and also controls inter-cell communication, allowing each
RC to work with a certain degree of independence The architectures of the
PE and CU are discussed in the following sub-sections
Trang 358 Multicore Technology: Architecture, Reconfiguration, and Modeling
RC requires additional functionality, we modified the PE to include signedarithmetic, logic, shifting, and comparison operations
Figure 1.3 shows the organization of the modified PE It includes the brid (signed/unsigned) arithmetic data path along with additional blocks forshifting and comparison operations The PE is designed by using preprocessorparameters, so that, depending on the instructions assigned to the RC, therequired modules of the PE are instantiated This parameterized approach re-sults in a dramatic improvement in resource utilization The arithmetic datapath is organized to provide single-cycle addition, subtraction, and multipli-cation operations The PE also provides two sets of registers at the input andoutput to enable accumulation-style operations, as often required for mediaprocessing applications Output is available at the registers every clock cycle
hy-1.3.2 Control Unit and Address Generator
The control unit provides the handshaking signals between memory and datapath, and ensures that the two units work in perfect synchronization witheach other The unit consists of a small instruction memory (compile-timeconfigurable, typically 10–100 instructions), three address generators (one foreach operand), instruction decoders, and instruction counters The wide in-struction word (the actual size depends on the size of the local memory, e.g.,
www.Ebook777.com
Trang 36Arithmetic Block Logic_Block
S2 S0
S2 S0 S5
Result[2*N-1:0]
S7
FIGURE 1.3
Block diagram of the MORA PE
92 bits for a 512-word memory) encodes the operation and base addressesfor an instruction’s operands, and the output data set and address offsets fortraversing through memory, as well as the number of times a specific opera-tion is to be performed The overall flow of the control unit is as shown inFigure 1.4
The address generator is shown in Figure 1.5 It accepts four data fields:base address, step, skip, and subset The base address is initiallyloaded into the address generator, and, depending on the values of step,skip, and subset, the address of the next memory location to fetch thedata is calculated The three fields allow the controller to move anywherethroughout the available data memory The address generation algorithm can
be written as shown in Algorithm 1.1 The address generator thus generatesthe range of addresses on which a given instruction is to be performed This
is a key feature for media processing applications which frequently involveoperations on matrices and vectors of data
1.3.3 Asynchronous Handshake
To minimize the impact of communication networks on the power tion of the array, each RC is equipped with a simple and resource efficientcommunication technique As every RC can in principle operate at a differ-ent clock speed, an asynchronous handshake mechanism was implemented As
Trang 37consump-FIGURE 1.4
MORA control unit flow chart
MORA is a streaming architecture, a two-way communication mechanism isrequired, one to communicate with the upstream RCs and another to commu-nicate with the downstream RCs Altogether, a total of four communicationI/O signals are used by each RC to communicate with the other RCs efficiently
in streaming fashion They are described as follows:
• rc rdy up is an output signal signifying that the RC is idle and ready toaccept data from upstream RCs
• rc rdy down is an input signal signifying that the downstream RCs areidle and ready to accept new data
Trang 38Algorithm 1.1: Address Generation Algorithm
address = base address
Trang 39• data rdy down is an output signal asserted when all the data transfers tothe downstream RCs are completed.
• data rdy up is an input signal to the RC corresponding to thedata rdy down signal from upstream RCs
Each RC can accept inputs either from two output ports of a single RC or fromtwo individual ports of different RCs The output of each RC can be routed
to, at most, four different RCs In order to support multiple RC connections
to a single cell, a two-bit vector is used for data rdy up (data rdy up[1:0]) and
a four-bit vector for rc rdy down (rc rdy down[3:0])
1.3.4 Execution Model
The RC has two operating modes: processing and loading When the RC isoperating in processing mode, it can either write the processed data backinto internal memory or write to a downstream RC For a formal description
of MORA’s execution model we refer the reader to Vanderbauwhede et al.(2009) Each RC has two execution modes while processing input data One
is sequential execution used for normal instructions (ADD, SUB, etc.) withwrite back option The second is pipelined execution for accumulation andinstructions with write out option Instructions with sequential execution takethree clock cycles to complete, with each clock cycle corresponding to reading,executing, and writing data to the RAM A prefetching technique is usedfor reading instructions from the instruction memory; this involves reading
a new instruction word while performing the last operation of the previousinstruction This approach enables the design to save one clock cycle for everynew instruction
For pipelined operations the controller utilizes the pipelining stage betweenthe RAM and PE This style of implementation allows the accumulation andwrite out operations to complete in n + 2 clock cycles The latency of 2 clockcycles results from reading and execution of the first set of operands Thesingle-cycle execution for instruction with write out option makes the RCvery efficient for streaming algorithms
The aim of the MORA Intermediate Representation (IR) language is to serve
as a compilation target for high-level languages such as MORA-C++ whilst
at the same time providing a means of programming the MORA processorarray at a low level
The language consists of three components: a coordination component
Trang 40MORA: High-Level FPGA Programming Using a Many-Core Framework 13which permits expression of the interconnection of the RCs in a hierarchi-cal fashion, an expression component which corresponds to the conventionalassembly languages for microprocessors and digital signal processors (DSPs),and a generation component which allows compile-time generation of coordi-nation and expression instances.
1.4.1 Expression Language
The MORA expression language is an imperative language with a very regularsyntax similar to other assembly languages: every line contains an instructionwhich consists of an operator followed by a list of operands The main differ-ences with other assembly languages are
• Typed operators: the type indicates the word size on which the operation
is performed, e.g., bit, byte, short
• Typed operands: operands are tuples indicating not only the address spacebut also the data type, i.e., word, row, column, or matrix
• Virtual registers and address banks: MORA has direct memory accessand no registers (an alternative view is that every memory location is aregister) Operations take the RAM addresses as operands; however, ‘vir-tual’ registers indicate where the result of an operation should be directed(RAM bank A/B, output L/R/both)
1.4.1.1 Instruction Structure
An instruction is of the general form:
instr ::= op nops dest opnd +
num ::=−(MEMSZ/2−1) (MEMSZ/2−1)
We illustrate these characteristics with an example The instruction for amultiply-accumulate of the first row of an 8× 8 matrix of 16-bit integers withthe first column of an 8× 8 matrix in the data memory reads in full:
www.Ebook777.com