Multicoer technology architecture recofiguration and modeling

In five parts, the book covers: • Architecture and design flow solutions, including the MORA framework for field programmable gate array FPGA programming, a synchronous data flow SDF-bas

Trang 1

The advent of the multicore processor era has impacted many areas

of computer architecture design, from memory management and

thread scheduling to inter-processor communication, debugging, power

management, and more Multicore Technology: Architecture,

Reconfiguration, and Modeling gives a holistic overview of the

field of multicore architectures to guide readers interested in further

research and development Featuring contributions by researchers

from renowned institutes around the world, this book explores a broad

range of topics It brings together a variety of perspectives on multicore

embedded systems and identifies the key technical challenges that

are being faced

In five parts, the book covers:

• Architecture and design flow solutions, including the MORA

framework for field programmable gate array (FPGA) programming, a

synchronous data flow (SDF)-based design flow, and an asymmetric

multi-processor system-on-chip (MPSoC) framework called SESAM

• Work being done on parallelism and optimization, including an

exten-sion to atomic verifiable operation (AVOp) streams to support loops

and a mechanism for accelerated critical sections (ACS) to reduce

performance degradation due to critical sections

• Tools for memory systems, including a multicore design space

exploration tool called TMbox and techniques for more efficient

shared memory architectures and scheduling

• Network-on-chip (NoC) issues, with coverage of interconnects;

routing topologies, router architecture, switching techniques, flow

control, traffic patterns, and routing algorithms; a comparison

between mesh- and tree-based NoCs in 3D systems-on-chip;

and a proposed performance evaluation method

A comprehensive survey of state-of-the-art research in multicore

processor architectures, this book is also a valuable resource for

anyone developing software and hardware for multicore systems

www.Ebook777.com

Trang 3

Embedded Multi-Core Systems

Series Editors

Fayez Gebali and Haytham El Miligi

University of Victoria Victoria, British Columbia

Multicore Technology: Architecture, Reconfiguration, and Modeling,

edited by Muhammad Yasir Qadri and Stephen J Sangwine

Autonomic Networking-On-Chip: Bio-Inspired Specification, Development,

and Verification, edited by Phan Cong-Vinh

Bioinformatics: High Performance Parallel Computer Architectures,

edited by Bertil Schmidt

Multi-Core Embedded Systems, Georgios Kornaros

Trang 4

CRC Press is an imprint of the

Taylor & Francis Group, an informa business

Boca Raton London New York

Trang 5

CRC Press

Taylor & Francis Group

6000 Broken Sound Parkway NW, Suite 300

Boca Raton, FL 33487-2742

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S Government works

Version Date: 20130624

International Standard Book Number-13: 978-1-4398-8064-7 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

transmit-For permission to photocopy or use material electronically from this work, please access www.copyright com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC,

a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used

only for identification and explanation without intent to infringe.

Visit the Taylor & Francis Web site at

http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com

www.Ebook777.com

Trang 6

List of Figures vii

1 MORA: High-Level FPGA Programming Using a Many-Core

Wim Vanderbauwhede, Sai Rahul Chalamalasetti, and Martin Margala

2 Implementing Time-Constrained Applications on a

Sander Stuijk, Akash Kumar, Roel Jordans, and Henk Corporaal

3 SESAM: A Virtual Prototyping Solution to Design Multicore

Nicolas Ventroux, Tanguy Sassolas, Alexandre Guerre, and CaaliphAndriamisaina

4 Verified Multicore Parallelism Using Atomic Verifiable

Michal Dobrogost, Christopher Kumar Anand, and Wolfram Kahl

5 Accelerating Critical Section Execution with Multicore

M Aater Suleman and Onur Mutlu

v

Trang 7

III Memory Systems 169

6 TMbox: A Flexible and Reconfigurable Hybrid Transactional

Prasun Ghosal and Tuhin Subhra Das

Mohammad Ayoub Khan and Abdul Quaiyum Ansari

13 Network-on-Chip Performance Evaluation Using an

Sahar Foroutan, Abbas Sheibanyrad, and Frédéric Pétrot

Trang 8

1.1 MORA-C++ tool chain 7

1.2 MORA reconfigurable cell (RC) 8

1.3 Block diagram of the MORA PE 9

1.4 MORA control unit flow chart 10

1.5 MORA address generator 11

1.6 Reconfigurable cell floating-point core 28

1.7 Vector support for RC architecture 30

1.8 Shared memory access interface architecture 31

1.9 ADG diagram for the DCT small algorithm 34

1.10 Slice count for benchmark algorithms 35

1.11 Effect of vectorization on Slice/BRAM counts 36

1.12 Throughput versus number of lanes for 8-bit benchmarks 37

2.1 SDF3/MAMPS design flow 43

2.2 Example SDFG and implementation of actor P 45

2.3 MAMPS platform architecture 49

2.4 MAMPS scheduling code for a processing element 50

2.5 Overview of the SDF3 mapping flow 53

2.6 Dataflow model for interconnect communication in MAMPS 54 2.7 Two tile MAMPS platform in XPS 56

2.8 The SDF graph for the MJPEG decoder 57

2.9 Measured and guaranteed worst-case throughput 58

3.1 SESAM overview 65

3.2 SESAM infrastructure 67

3.3 Timed TLM versus approximate-timed TLM 68

3.4 Routing example for a mesh network 69

3.5 SESAM programming model 74

3.6 Structure of the debugging solution implemented in SESAM 78 3.7 PowerArchC: Power model generation flow 82

3.8 PowerArchC: Power-aware ISS architecture generation 83

3.9 DPM and DVFS techniques timing issues 84

3.10 Summary of buffer monitors and scheduling implications 86

3.11 SESAM exploration tool and environment 88

3.12 SESAM AGP toolchain 89

3.13 Example of automatic parallelization 90

vii

Trang 9

3.14 Parallelization of SESAM simulations 92

3.15 SCMP architecture 93

3.16 Evaluation of SESAM accuracy 96

3.17 SESAM simulation speed 97

3.18 Network performance results 98

3.19 SCMP performance profiling with a variable number of PE 99 3.20 Power aware scheduling results with the WCDMA application 101 4.1 Locally sequential program 117

4.2 Φ is defined for other cores 119

4.3 Visualization of Φ dependency 119

4.4 Φ map at instruction SendSignal s3→ c3 120

4.5 Φ map at instruction W aitSignal s3 121

4.6 Example using the Loop AVOp 125

4.7 Example of an unrolled loop 125

4.8 Example with nested loops 126

4.9 Non-rewritable loop example 132

4.10 Motivating example for loop rewriting 134

4.11 Motivating example unrolled 135

4.12 Effects of an inner loop 138

4.13 Loop with rewriting verified without fully unrolling 139

4.14 Rewritable loop unrolled into a loop without a rewrite 139

4.15 Defining diagram for projected rewrite 145

4.16 Accessing global memory 146

5.1 Amdahl’s serial part, parallel part, and critical section in a multithreaded 15-puzzle kernel 156

5.2 Accelerated Critical Sections (ACS) 158

5.3 Source code and its execution: baseline and ACS 159

5.4 Execution time when number of threads is optimal for each application 162

5.5 Execution time when number of threads equals number of contexts 162

5.6 Speedup over a single small core 163

5.7 ACS versus TLR performance 165

6.1 An 8-core TMbox infrastructure 177

6.2 TMbox MIPS assembly for atomic{a++} 181

6.3 Cache state diagram 182

6.4 Eigenbench results on 1–16 cores 185

6.5 SSCA2 benchmark results on 1–16 cores 186

6.6 Intruder benchmark results on 1–16 cores 186

7.1 Hybrid migration/remote-access architecture 196

7.2 Efficient execution migration in a five-stage CPU core 197

7.3 Average memory latency costs 200

Trang 10

List of Figures ix

7.4 Parallel completion time under different DirCC protocols 206

7.5 Cache hierarchy miss rates at various cache sizes 207

7.6 The performance of DirCC (under a MOESI protocol) 210

7.7 Cache hierarchy miss rates for EM2 and RA designs 212

7.8 Non-local memory accesses in RA baseline 213

7.9 Per-benchmark core miss rates 214

7.10 Core miss rates handled by remote accesses 214

7.11 The performance of EM2and RA variants relative to DirCC 215 7.12 Dynamic energy usage for all EM2 and RA variants 216

7.13 EM2 performance scales with network bandwidth 217

8.1 Accuracy of basic Estimate-M method on a dual-core system 230 8.2 Occupancy and estimation error 231

8.3 Two pairs of co-runners in dual-core systems 232

8.4 Cache occupancy for four co-runners in a quad-core system 233 8.5 Occupancy estimation for an over-committed quad-core sys-tem (Part 1) 234

8.6 Occupancy estimation for an over-committed quad-core sys-tem (Part 2) 235

8.7 Fine-grained occupancy estimation in an over-committed quad-core system 235

8.8 Effect of memory bandwidth contention on the MPKC miss-rate curve for the SPEC CPU2000 mcf workload 237

8.9 Miss-ratio curves (MRCs) for various SPEC CPU workloads, obtained online by CAF´E versus offline by page coloring 240

8.10 MRC for mcf with different co-runners 241

8.11 Vtime compensation 246

8.12 Cache divvying occupancy prediction 249

8.13 Co-runner placement 250

9.1 Remote debugging scenario – software view 260

9.2 Debugging multiple cores through IEEE 1149.1 (JTAG) 264

9.3 Debugging a single-core SoC through In-Circuit Emulation 265 9.4 Debugging through trace generation and ICE 267

9.5 Example: Creating a tracepoint in the GDB debugger 268

9.6 Trace-based debugging scenario 275

9.7 Trace compression scheme 275

9.8 Finite context method 277

9.9 Huffman tree for prefix encoding 279

10.1 Side view of multipath interconnect 287

10.2 Network-on-chip concept 289

10.3 Reduction of interconnect length from 2D ICs to 3D ICs 292

10.4 Schematic representation of TSV first, middle, and last pro-cesses 293

www.Ebook777.com

Trang 11

10.5 Schematic of photonic interconnect using micro ring

res-onators 293

10.6 A simple schematic of a micro ring resonator 294

10.7 A photonic switch and a non-blocking photonic router 295

10.8 Torus topology 296

10.9 Concentrated mesh topology and wireless routers 297

11.1 Factors affecting the performance of an NoC 301

11.2 Mesh topology 302

11.3 Torus 302

11.4 Folded torus 303

11.5 Octagon 303

11.6 Star 303

11.7 Binary tree 303

11.8 Butterfly 304

11.9 Butterfly fat tree 304

11.10 Honeycomb 304

11.11 Mesh-of-tree 305

11.12 Diametric 2D mesh 305

11.13 Diametric 2D mesh of tree 306

11.14 A 9× 9 structural diametrical 2D mesh 307

11.15 A 9× 9 star type topology 307

11.16 Custom mesh topology 308

11.17 3D irregular mesh 310

11.18 Dragonfly topology 312

11.19 Wireless mesh 314

11.20 MORFIC (mesh overlaid with RF interconnect) 316

11.21 Hybrid ring 316

11.22 Hybrid star 316

11.23 Hybrid tree 317

11.24 Hybrid irregular topology 317

11.25 A typical router architecture 317

11.26 Router data flow 318

11.27 Different routing policies 322

11.28 West first turn 326

11.29 North last turn 326

11.30 Negative first turn 326

12.1 Classification of interconnection networks 336

12.2 Basic network topologies 337

12.3 Diameter in a connected graph 338

12.4 3-D mesh and torus topologies 340

12.5 Binary tree 342

12.6 Proposed topology with different levels (l = 1, 2, and 3) 343

12.7 Ring based tree topology 343

Trang 12

12.8 Layout of the proposed topology 34712.9 Number of nodes in level l 34812.10 Degree and diameter analysis of the proposed topology 34912.11 3-D tree mesh 35013.1 Operational layered concept of an NoC-based SoC architec-ture 36113.2 The relation between NoC layers and levels of abstractionfrom a performance evaluation viewpoint 36313.3 A generic design flow for an NoC-based system 36413.4 Performance requirements versus performance analysis 36813.5 Optimization loop: architectural exploration and mapping ex-ploration 37313.6 At each router of the path, disrupting packets appear proba-bilistically in front of the tagged packet 39013.7 Dependency trees corresponding to the latency (a) ‘core tosouth’ and (b) ‘core to east’ of r3,4 in a 6× 5 2D mesh NoCwith the x-first routing algorithm 39213.8 Router delay model related to a 2D mesh NoC 39313.9 The average number of accumulated flits in the output buffer

at the arrival of Pi when there is no header contention 39713.10 The order of delay component computation in one iteration 39713.11 Buffer occupancy caused by Pj at time instant (a) t and (b)

t + 3 when Pj is transferred and Pi can be written into thebuffer 39813.12 Iterative computation for inputs {1, 2, 3, 4} of router r 40013.13 Latency/load curves for the path r2,4 → r4,2 with bufferlengths in flits as indicated and uniform traffic (path latencyexcludes the source queue waiting time) 40213.14 Latency/load curves for the path r2,4 → r4,2 with bufferlengths in flits as indicated and localized traffic 40313.15 Analytical method for different buffer lengths and 0.01% of-fered load steps 40413.16 The average utilization of buffer r3,4 → r4,4 under two trafficdistributions 405

Trang 14

1.1 Utilization Results of Single Precision Floating-Point Core on

Virtex 4 LX200 27

1.2 Latency of Shared Memory Interface Modules 32

1.3 Benchmark Implementation Results (No Vectorization) 35

1.4 Benchmark Throughput Results for a Single DMA Channel without RC Vectorization 35

1.5 Benchmark Throughput Results with Multiple DMA Chan-nels and RC Vectorization 37

1.6 DCT Benchmark Throughput Comparison 39

2.1 Designer Effort 59

3.1 Hardware Abstraction Layer of SESAM 76

3.2 Basic Remote Protocol Support Commands 79

3.3 Additional Remote Protocol Commands for Fast Debugging 79 5.1 Best Number of Threads for Each Configuration 164

6.1 LUT Occupation of Components of the Honeycomb Core 176

6.2 HTM Instructions for TMbox 179

6.3 TM Benchmarks Used 184

7.1 Various Parameter Settings for the Analytical Cost Model for the ocean contiguous Benchmark 201

7.2 System Configurations Used 202

7.3 Area and Energy Estimates 208

7.4 Synthetic Benchmark Settings 210

9.1 Example of CAE with 16-Bit Addresses 274

9.2 Address Encoding Scheme 278

9.3 Example of Differential Address Encoding – 16-Bit Addresses 278 11.1 Relative Comparison of 2D Irregular Topologies 309

11.2 Comparison of Optical Network Topologies 313

11.3 The Wavelength Assignment of 4-WRON 330

12.1 Classification of NoC Topology 335

xiii

Trang 15

12.2 Analysis of Network Parameters for Base Module 34713.1 Characteristics of Analytical Methods 38713.2 Parameters of the Analytical Performance Evaluation Method 38913.3 Comparing Simulation and Analytical Tool Runtimes 406

Trang 16

Multicore processor architectures are now mainstream even in applicationssuch as mobile or portable telephones For decades, computer architectureevolved through increases in the size and complexity of processors, and re-ductions in their cost and energy consumption Eventually, however, therecame a point where further increases in complexity of a single processor wereless desirable than providing multiple cores on the same chip The advent

of the multicore era has altered many concepts relating to almost all of theareas of computer architecture design including core design, memory manage-ment, thread scheduling, application support, inter-processor communication,debugging, power management, and many more This book provides a point

of entry into the field of multicore architectures, covering some of the mostresearched aspects

What to look for in it

This book is targeted not only to give readers a holistic overview of the fieldbut also to guide them to further avenues of research by covering the state-of-the-art in this area The book includes contributions from renowned institutesacross the globe with authors from the following institutes contributing to thebook (ordered alphabetically):

Barcelona Supercomputing Center, Spain

Bengal Engineering and Science University, Shibpur, India

Boston University, Boston, USA

CEA LIST, Embedded Computing Lab, France

Eindhoven University of Technology, The Netherlands

Google Inc., USA

Jamia Millia Islamia (Central University), New Delhi, India

Laboratoire TIMA, Grenoble, France

Massachusetts Institute of Technology, USA

McGill University, Montreal, Canada

McMaster University, Canada

National University of Singapore, Singapore

xv

Trang 17

University of Glasgow, Glasgow, UK

University of Massachusetts, Lowell, MA, USA

University of Texas at Austin, USA

VMware Inc., USA

The book is divided into five parts: Architecture and Design Flow, allelism and Optimization, Memory Systems, Debugging, and Networks-on-Chip

Par-The contents of each section are discussed in the following

Architecture and Design Flow

This part contains three chapters

Chapter 1, MORA: High-Level FPGA Programming Using a Core Framework, presents an overview of the MORA framework, a high-level programmable multicore FPGA system based on a dataflow network

Many-of Processors-in-Memory The MORA framework is targeted to simplifydataflow-based FPGA programming in C++ using a dedicated ApplicationProgrammer’s Interface (API) The authors demonstrate an image processingapplication implemented using over a thousand cores

Chapter 2, Implementing Time-Constrained Applications on a PredictableMultiprocessor System-on-Chip, presents a Synchronous Data Flow (SDF)-based design flow that instantiates different architectures using a template.The proposed design flow can generate an implementation of an application on

a MPSoC while providing throughput guarantees to the application Thereforethe platform presented supports fast design space exploration for real-timeembedded systems and is also extendable to heterogeneous applications.Chapter 3, SESAM: A Virtual Prototyping Solution to Design Multi-core Architectures for Dynamic Applications, presents an asymmetric MP-SoC framework called SESAM The MPSoC exploration environment can beused for a complete MPSoC design flow It can help the design and sizing

of complex architectures, as well as the exploration of application parallelism

on multicore platforms, to optimize the area and performance efficiency ofembedded systems SESAM can integrate various instruction set simulators

at the functional or cycle-accurate level, as well as different networks-on-chip,DMA, a memory management unit, caches, memories, and different controlsolutions to schedule and dispatch tasks The framework also supports theenergy modeling of the MPSoC design

Parallelism and Optimization

This part contains two chapters

Chapter 4, Verified Multicore Parallelism using Atomic Verifiable tions, presents an extension to Atomic Verifiable Operation (AVOp) streams tosupport loops An AVOp is the basic instruction in the Domain Specific Lan-

Trang 18

Opera-guage (DSL) proposed by the authors AVOp streams allow performance to bemaximized by introducing an algorithm for scheduling across different threads

of execution so as to minimize contention in a synchronous operation The thors also present a verification algorithm that guarantees hazard avoidancefor any possible execution order This framework enables a programmer toexpress complex communication patterns and hide communication latencies

au-in an approach similar to software pipelau-inau-ing of loops

Chapter 5, Accelerating Critical Section Execution with AsymmetricMulti-Core Architectures, presents a mechanism for Accelerated Critical Sec-tions (ACS) to reduce performance degradation due to critical sections Criti-cal sections are those sections of code that access mutually shared data amongthe cores The principle of Mutual Exclusion dictates that threads cannot beallowed to update shared data concurrently; thus, accesses to shared data areencapsulated inside critical sections This in effect can serialize threads, andreduce performance and scalability Therefore, in order to avoid this perfor-mance loss, the authors propose acceleration of critical sections on a high-performance core of an Asymmetric Chip Multiprocessor (ACMP)

Memory Systems

Chapter 6, TMbox: A Flexible and Reconfigurable Hybrid Transactional ory System, presents a multicore design space exploration tool called TMbox.This flexible experimental systems platform is based on an FPGA and offers

Mem-a scMem-alMem-able Mem-and high-performMem-ance multiprocessor System-on-Chip (SoC) mentation that is configurable for integrating various Instruction Set Archi-tecture (ISA) options and hardware organizations Furthermore, the proposedplatform is capable of executing operating systems and has an extensive sup-port for Hybrid Transactional Memory Systems

imple-Chapter 7, EM2: A Scalable Shared Memory Architecture for Large-scaleMulticores, presents a technique to provide deadlock-free migration-based co-herent shared memory to the Non-Uniform Cache Access (NUCA) family ofarchitectures Using the proposed Execution Migration Machine (EM2), theauthors claim to achieve performance comparable to directory-based archi-tectures without using directories Furthermore, the proposed scheme is bothenergy and area efficient

Chapter 8, CAF´E: Cache-Aware Fair and Efficient Scheduling for CMPs,introduces an efficient online technique for generating Miss Ratio Curves(MRCs) and other cache utility curves that uses hardware performance coun-ters available on commodity processors Based on these monitoring and infer-ence techniques, the authors also introduce methods to improve the fairnessand efficiency of CMP scheduling decisions

Trang 19

In Chapter 9, Software Debugging Infrastructure for Multi-Core Chip, the authors present an overview of the existing multithreaded softwaredebugging schemes and discuss challenges that are being faced by the designers

Systems-on-of multi-core systems-on-chip The authors conclude that traditional ging methods are not suitable for debugging concurrently executing multi-threaded software Furthermore, the use of trace generation to complementtraditional debugging methods is gaining traction and is expected to take anincreased role in the debugging of future multi-threaded software However,for trace generation based schemes, transfer of massive amounts of trace dataoff-the-chip for analysis is one of the major problems The authors present

debug-an instruction-address trace compression scheme that aims to mitigate thisproblem

Network-On-Chip

This section contains four chapters Chapter 10, On Chip Interconnects forMulti-Core Architectures, presents a detailed study of the state of the art ininterconnects used in multi-core architectures The technologies discussed bythe authors include Three Dimensional, Photonic, Wireless, RF Waveguide,and Carbon Nanotubes based Interconnects

Chapter 11, Routing in Multi-Core NoCs, presents an overview and survey

of routing topologies, router architecture, switching techniques, flow control,traffic patterns, routing algorithms, and challenges faced by the existing ar-chitectures for on-chip networks

Chapter 12, Efficient Topologies for 3-D Networks-on-Chip, presents acomparison between mesh- and tree-based NoCs in a 3D SoC The authorsconclude that for 3D SoCs both mesh- and tree-based NoCs are capable ofachieving better performance compared to the traditional 2D implementa-tions However, proposed tree-based topologies show significant performancegains in terms of network diameter and degree and number of nodes, andachieve significant reductions in energy dissipation and area overhead withoutany change in throughput and latency

Finally, Chapter 13, Network-on-Chip Performance Evaluation Using anAnalytical Method, presents an analytical performance evaluation method forNoCs that permits an architectural exploration of the network layer for a givenapplication Additionally, for a given network architecture, the method allowsexamination of the performance of different mappings of the application onthe NoC The proposed method is based on the computation of probabilitiesand contention delays between packets competing for shared resources, andprovides a comprehensive delay analysis of the network layer

Trang 20

We thank the team at CRC Press: Nora Konopka, Publisher, for supportingour proposal for this book; Kari Budyk and Michele Dimont, for keeping us ontrack and for assisting us promptly and courteously with our many questionsand queries; and Shashi Kumar, for helping us around our LATEX difficulties

We also thank our wives, Dr Nadia N Qadri and Dr Elizabeth Shirley, who,although they have never met, have shared an experience that we inflicted

on them, as we worked long hours editing this book, when we should havebeen spending time with them We have often read such thanks (or indeedapologies) in other books, but now we understand why we must acknowledgetheir contribution to this book Nadia and Elizabeth, thank you both for yoursupport and patience

Muhammad Yasir QadriIslamabad, PakistanStephen J SangwineColchester, United Kingdom

December 2012

www.Ebook777.com

Trang 22

Muhammad Yasir Qadri was born in Pakistan in 1979 He graduated fromMehran University of Engineering and Technology in Electronic Engineering.

He obtained his PhD in Electronic Systems Engineering from the School ofComputer Science and Electronic Engineering, University of Essex, UK Hisarea of specialization is energy/performance optimization in reconfigurableMPSoC architectures Before his time at Essex, he was actively involved inthe development of high-end embedded systems for commercial applications

He is an Approved PhD Supervisor by the Higher Education Commission ofPakistan, and is currently working as a Visiting Faculty Member at HITECUniversity, Taxila, Pakistan

Stephen J Sangwine was born in London in 1956 He received a BSc degree

in Electronic Engineering from the University of Southampton, Southampton,

UK, in 1979, and his PhD from the University of Reading, Reading, UK, in

1991, for work on digital circuit fault diagnosis He was a Lecturer in theDepartment of Engineering at the University of Reading from 1985 – 2000, andsince 2001 has been a Senior Lecturer at the University of Essex, Colchester,

UK His interests include color image processing and vector signal processingusing hypercomplex algebras, and digital hardware design and test

xxi

Trang 24

Christopher Kumar Anand

Department of Computing and

Abdul Quaiyum Ansari

Department of Electrical Engineering

Jamia Millia Islamia

New Delhi, India

Oriol Arcas

Barcelona Supercomputing Center

Universitat Polit`ecnica de Catalunya

Barcelona, Spain

Sai Rahul Chalamalasetti

Department of Electrical and

Computer Engineering

University of Massachusetts

Lowell, MA, USA

Myong Hyon Cho

Massachusetts Institute of

Technology

Cambridge, MA, USA

Henk Corporaal

Eindhoven University of Technology

Eindhoven, The Netherlands

Adri´an CristalBarcelona Supercomputing CenterCSIC — Spanish National ResearchCouncil

Barcelona, SpainTuhin Subhra DasDepartment of InformationTechnology

Bengal Engineering and ScienceUniversity

Shibpur, IndiaSrinivas DevadasMassachusetts Institute ofTechnology

Cambridge, MA, USAMichal DobrogostDepartment of Computing andSoftware

McMaster UniversityHamilton, Ontario, CanadaSahar Foroutan

TIMA Laboratory, SLS TeamGrenoble, France

Prasun GhosalDepartment of InformationTechnology

Bengal Engineering and ScienceUniversity

Shibpur, India

xxiii

Trang 25

Wolfram Kahl

Department of Computing and

Software

McMaster University

Hamilton, Ontario, Canada

Mohammad Ayoub Khan

Center for Development of Advanced

Bojan MihajlovićDepartment of Electrical andComputer EngineeringMcGill UniversityMontreal, CanadaOnur MutluDepartment of Electrical andComputer EngineeringCarnegie Mellon UniversityPittsburgh, PA, USAFrédéric PétrotTIMA Laboratory, SLS TeamGrenoble, France

Soumyajit PoddarSchool of VLSI TechnologyBengal Engineering and ScienceUniversity

Shibpur, IndiaTanguy SassolasEmbedded Computing LabCEA LIST

Gif-sur-Yvette, FranceHamed SheibanyradTIMA Laboratory, SLS TeamGrenoble, France

Keun Sup ShimMassachusetts Institute ofTechnology

Cambridge, MA, USASatnam SinghGoogle, Inc

Mountain View, CA, USA

Trang 26

Nehir Sonmez

Barcelona, Spain

Sander Stuijk

Gif-sur-Yvette, FranceCarl A Waldspurger(Formerly at) VMware Inc

Palo Alto, CA, USARichard WestDepartment of Computer ScienceBoston University

Boston, MA, USAPuneet ZarooVMware Inc

Palo Alto, CA, USAXiao ZhangGoogle, Inc

Mountain View, CA, USAˇ

Zeljko ˇZili´cDepartment of Electrical andComputer EngineeringMcGill UniversityMontreal, Canada

Trang 27

AML Average Memory Latency

API Application Programming

Interface

ASIC Application-Specific

Integrated Circuit

BRAM Block RAM

CABA Cycle Accurate Bit Accurate

CMP Chip Multi-Processor

CPI Cycles Per Instruction

CPU Central Processing Unit

DCT Discrete Cosine Transform

DDR Double Data Rate

DMA Direct Memory Access

DRAM Dynamic RAM

DSP Digital Signal Processor

DWT Discrete Wavelet Transform

ECC Error-Correcting Code

FPGA Field Programmable

Gate Array

FPU Floating-Point Unit

GALS Globally Asynchronous

Locally Synchronous

GHz Gigahertz

GPU Graphics Processing Unit

HAL Hardware Abstraction Layer

HDL Hardware Description

ISA Instruction Set Architecture

ISS Instruction Set Simulator

ITRS International Technology

ChipMRC Miss-Ratio Curve

NI Network InterfaceNoC Network-on-ChipNUCA Non-Uniform Cache AccessNUMA Non-Uniform Memory Ac-

cess

OS Operating SystemOSI Open System Interconnec-

tionPIM Processor In MemoryQoS Quality of Service

RAM Random Access MemoryRISC Reduced Instruction Set

ComputerRTL Register Transfer LevelSDAR Sampled Data Address Reg-

isterSoC System-on-ChipSPEC Standard Performance Eval-

uation CorporationSRAM Static RAMSDRAM Synchronous Dynamic RAMTDM Time Division MultiplexingTLB Translation Look-Aside

BufferTLM Transaction Level Modeling

VC Virtual ChannelVtime Virtual Time

Trang 28

Part I

Architecture and Design

Flow

Trang 30

MORA: High-Level FPGA Programming Using a Many-Core Framework

Wim Vanderbauwhede

School of Computing Science, University of Glasgow, Glasgow, UK

Sai Rahul Chalamalasetti and Martin Margala

Department of Electrical and Computer Engineering, University of sachusetts, Lowell, MA, USA

Mas-CONTENTS

1.1 Overview of the State of the Art in High-Level FPGA

Programming 41.2 Introduction to the MORA Framework 61.2.1 MORA Concept 61.2.2 MORA Tool Chain 61.3 The MORA Reconfigurable Cell 61.3.1 Processing Element 81.3.2 Control Unit and Address Generator 81.3.3 Asynchronous Handshake 91.3.4 Execution Model 121.4 The MORA Intermediate Representation 121.4.1 Expression Language 131.4.2 Coordination Language 151.4.3 Generation Language 161.4.4 Assembler 171.5 MORA-C++ API 181.5.1 Key Features 191.5.2 MORA-C++ by Example 191.5.3 MORA-C++ Compilation 221.5.4 Floating-Point Compiler (FloPoCo) Integration 271.6 Hardware Infrastructure for the MORA Framework 291.6.1 Direct Memory Access (DMA) Channel Multiplexing 291.6.2 Vectorized RC Support 291.6.3 Shared Memory Access 301.7 Results 331.7.1 Thousand-Core Implementation 33

3

Trang 31

1.7.2 Results 341.7.3 Comparison with Other DCT Implementations 381.8 Conclusion and Future Work 40

This chapter presents an overview of the current state of the MORA work, a high-level programmable multicore FPGA system based on a dataflownetwork of Processors-in-Memory The aim of the MORA framework is tosimplify dataflow-based FPGA programming while still delivering excellentperformance, by providing a streaming dataflow framework that can be pro-grammed in C++ using a dedicated Application Programmer’s Interface(API) Many of the restrictions common to most other C-to-gates tools donot apply to MORA because of the adoption of processors rather than LUTs

frame-as the smallest unit of the design MORA’s processors are unique frame-as theyare specialised in terms of instruction set, data path width, and memory sizefor the particular section of the program that runs on them As a result, wehave demonstrated an image processing application implemented using over

a thousand cores

The chapter starts with the background and rationale for this work andthe state of the art The subsequent sections discuss in detail the hardwareand software aspects of the MORA framework: architecture, hardware in-frastructure, and tool chain for the FPGA; design of the MORA-C++ API,the Intermediate Representation, compiler, and assembler The final sectionspresent and discuss benchmark results for several streaming data processingalgorithms to demonstrate the performance of the current system, and outlineavenues for future research

FPGA Programming

Media processing architectures and algorithms have come to play a major role

in modern consumer electronics, with applications ranging from basic nication devices to high-level processing machines Therefore architectures andalgorithms that provide adaptability and flexibility at a very low cost havebecome increasingly popular for implementing contemporary multimedia ap-plications Reconfigurable or adaptable architectures are widely being seen

commu-as viable alternatives to extravagantly powerful General Purpose Processors(GPP) as well as tailor-made but costly Application Specific Integrated Cir-

Trang 32

cuits (ASICS) Over the last few years, FPGA devices have grown in size andcomplexity As a result, many applications that were previously restricted toASIC implementations can now be deployed on reconfigurable platforms Re-configurable devices such as FPGAs offer the potential of very short designcycles and reduced time to market.

However, with the ever increasing size and complexity of modern timedia processing algorithms, mapping them onto FPGAs using HardwareDescription Languages (HDLs) like VHDL or Verilog provided by many FPGAvendors has become increasingly difficult To overcome this problem severalgroups in academia as well as industry have engaged in developing high-levellanguage support for FPGA programming The most common approaches fallinto three main categories: HLL-to-gates, system builders, and soft processors.The HLL-to-gates design flow starts from a program written in a High-Level Language (HLL, typically a dialect of C) with additional keywordsand/or pragmas, and converts these programs into a Hardware DescriptionLanguage (HDL) such as Verilog or VHDL Examples of commercial tools inthis category are Handel-C (Sullivan, Wilson, and Chappell 2004), Impulse-

mul-C (Santambrogio et al 2007), Xilinx’ AutoESL (mul-Cong 2008), and Maxeler’s/MaxCompiler/ (Howes et al 2006) Academic solutions include Streams-C(Gokhale et al 2000), Trident (Tripp et al 2005), and ROCCC (Buyukkurt,Guo, and Najjar 2006) Despite the advantage of a shorter learning curve forprogrammers to understand these languages, a significant disadvantage of thisC-based coding style is that it is customized to suit Von Neumann processorarchitectures, which cannot fully extract parallelism out of FPGAs

By system builders we mean solutions that will generate complex IP coresfrom a high-level description, often using a wizard Examples are Xilinx’CoreGen and Altera’s Mega wizard These tools greatly enhance productivitybut are limited to creating designs using parameterized predefined IP cores.Graphical tools such as MATLAB-Simulink and NI LabVIEW also fall intothis category

Finally, soft processors have increasingly been seen as strong players inthis category Each FPGA vendor provides its own soft cores such as Microb-laze and Picoblaze from Xilinx and Nios from Altera However, the traditionalarchitectures with shared memory access and mutual memory access are farfrom ideal to exploit the inherent parallelism inherent in FPGAs for mediaprocessing applications To address this problem, different processor architec-tures are needed One such architecture has been proposed and commercialized

by Mitrionics: the ‘Mitrion Virtual Processor’ (MVP) is a massively parallelprocessor that can be customized for the specific programs that run on it (Kin-dratenko, Brunner, and Myers 2007) Other alternatives are processor arrayssuch as proposed by Craven, Patterson, and Athanas (2006), which are based

on the OpenFire processor or the MOLEN reconfigurable processor compiler(Panainte, Bertels, and Vassiliadis 2007)

Trang 33

1.2 Introduction to the MORA Framework

In this section we introduce the MORA framework We discuss the applicationdomain and rationale for the framework and introduce the main concepts andbuilding blocks of the MORA framework, the MORA system abstraction, andthe adopted approach to high-level FPGA programming

1.2.1 MORA Concept

The MORA framework (Vanderbauwhede et al 2009, 2010) is targeted at theimplementation of high performance streaming algorithms It allows the ap-plication developer to write an FPGA application using a C++ API (MORA-C++) which essentially implements a Communicating Sequential Processes(CSP) paradigm (Hoare 1978) The toolchain converts the program into acompile-time generated network (a directed dataflow graph) where every node

is implemented on a compile-time configurable Processor-in-Memory (PIM),called the MORA Reconfigurable Cell (RC) (Chalamalasetti et al 2009) Each

RC is tailored (in terms of instruction set and memory size) to the specificcode implementing the process As a result, the MORA framework funda-mentally differs from other HLL-to-gates languages as it allows memory-basedconstructs and algorithms (e.g., stack machines and pointer-based data struc-tures)

1.2.2 MORA Tool Chain

Figure 1.1 shows the MORA tool chain, which consists of the MORA-C++compiler, the MORA assembler, and the FPGA back-end (currently targetingthe SGI RC-100 platform)

From the MORA-C++ source code, the compiler emits MORA diate Representation (IR) language, which is transformed into Verilog andimplemented using the Xilinx ISE tools The assembler can also transformthe IR language into a cycle-approximate SystemC model, allowing fast sim-ulation of the design Combined with the ability to compile the source codeusing g++, this provides a powerful development system allowing rapid designiterations

The MORA architecture consists of a compile-time generated network of configurable Cells Although MORA supports access to shared memory, the

Trang 34

Re-FIGURE 1.1

MORA-C++ tool chain

memory architecture is distributed: storage for data is partitioned among RCs

by providing each RC with internal data memory As each RC is a in-Memory (PIM), computations can be performed close to memory, therebyreducing memory access time This architecture results in the high memoryaccess bandwidth needed to efficiently support massively parallel computa-tions required in multimedia applications while maintaining generality andscalability of the system

Processor-The external data exchange is managed by an I/O Controller which canaccess the internal memory of the input and output RCs through standardmemory interface instructions The internal data flow is controlled in a dis-tributed manner through a handshaking mechanism between the RCs.Each RC (Figure 1.2) consists of a Processing Element (PE) for arith-metic and logical operations with configurable word size, a small (order of

1 KB) dual-port data memory implemented using FPGA block RAMs, and aControl Unit (CU) with a small instruction memory The PE forms the maincomputational unit of the RC; the control unit synchronises the instructionqueue within the RC and also controls inter-cell communication, allowing each

RC to work with a certain degree of independence The architectures of the

PE and CU are discussed in the following sub-sections

Trang 35

8 Multicore Technology: Architecture, Reconfiguration, and Modeling

RC requires additional functionality, we modified the PE to include signedarithmetic, logic, shifting, and comparison operations

Figure 1.3 shows the organization of the modified PE It includes the brid (signed/unsigned) arithmetic data path along with additional blocks forshifting and comparison operations The PE is designed by using preprocessorparameters, so that, depending on the instructions assigned to the RC, therequired modules of the PE are instantiated This parameterized approach re-sults in a dramatic improvement in resource utilization The arithmetic datapath is organized to provide single-cycle addition, subtraction, and multipli-cation operations The PE also provides two sets of registers at the input andoutput to enable accumulation-style operations, as often required for mediaprocessing applications Output is available at the registers every clock cycle

hy-1.3.2 Control Unit and Address Generator

The control unit provides the handshaking signals between memory and datapath, and ensures that the two units work in perfect synchronization witheach other The unit consists of a small instruction memory (compile-timeconfigurable, typically 10–100 instructions), three address generators (one foreach operand), instruction decoders, and instruction counters The wide in-struction word (the actual size depends on the size of the local memory, e.g.,

www.Ebook777.com

Trang 36

Arithmetic Block Logic_Block

S2 S0

S2 S0 S5

Result[2*N-1:0]

S7

FIGURE 1.3

Block diagram of the MORA PE

92 bits for a 512-word memory) encodes the operation and base addressesfor an instruction’s operands, and the output data set and address offsets fortraversing through memory, as well as the number of times a specific opera-tion is to be performed The overall flow of the control unit is as shown inFigure 1.4

The address generator is shown in Figure 1.5 It accepts four data fields:base address, step, skip, and subset The base address is initiallyloaded into the address generator, and, depending on the values of step,skip, and subset, the address of the next memory location to fetch thedata is calculated The three fields allow the controller to move anywherethroughout the available data memory The address generation algorithm can

be written as shown in Algorithm 1.1 The address generator thus generatesthe range of addresses on which a given instruction is to be performed This

is a key feature for media processing applications which frequently involveoperations on matrices and vectors of data

1.3.3 Asynchronous Handshake

To minimize the impact of communication networks on the power tion of the array, each RC is equipped with a simple and resource efficientcommunication technique As every RC can in principle operate at a differ-ent clock speed, an asynchronous handshake mechanism was implemented As

Trang 37

consump-FIGURE 1.4

MORA control unit flow chart

MORA is a streaming architecture, a two-way communication mechanism isrequired, one to communicate with the upstream RCs and another to commu-nicate with the downstream RCs Altogether, a total of four communicationI/O signals are used by each RC to communicate with the other RCs efficiently

in streaming fashion They are described as follows:

• rc rdy up is an output signal signifying that the RC is idle and ready toaccept data from upstream RCs

• rc rdy down is an input signal signifying that the downstream RCs areidle and ready to accept new data

Trang 38

Algorithm 1.1: Address Generation Algorithm

address = base address

Trang 39

• data rdy down is an output signal asserted when all the data transfers tothe downstream RCs are completed.

• data rdy up is an input signal to the RC corresponding to thedata rdy down signal from upstream RCs

Each RC can accept inputs either from two output ports of a single RC or fromtwo individual ports of different RCs The output of each RC can be routed

to, at most, four different RCs In order to support multiple RC connections

to a single cell, a two-bit vector is used for data rdy up (data rdy up[1:0]) and

a four-bit vector for rc rdy down (rc rdy down[3:0])

1.3.4 Execution Model

The RC has two operating modes: processing and loading When the RC isoperating in processing mode, it can either write the processed data backinto internal memory or write to a downstream RC For a formal description

of MORA’s execution model we refer the reader to Vanderbauwhede et al.(2009) Each RC has two execution modes while processing input data One

is sequential execution used for normal instructions (ADD, SUB, etc.) withwrite back option The second is pipelined execution for accumulation andinstructions with write out option Instructions with sequential execution takethree clock cycles to complete, with each clock cycle corresponding to reading,executing, and writing data to the RAM A prefetching technique is usedfor reading instructions from the instruction memory; this involves reading

a new instruction word while performing the last operation of the previousinstruction This approach enables the design to save one clock cycle for everynew instruction

For pipelined operations the controller utilizes the pipelining stage betweenthe RAM and PE This style of implementation allows the accumulation andwrite out operations to complete in n + 2 clock cycles The latency of 2 clockcycles results from reading and execution of the first set of operands Thesingle-cycle execution for instruction with write out option makes the RCvery efficient for streaming algorithms

The aim of the MORA Intermediate Representation (IR) language is to serve

as a compilation target for high-level languages such as MORA-C++ whilst

at the same time providing a means of programming the MORA processorarray at a low level

The language consists of three components: a coordination component

Trang 40

MORA: High-Level FPGA Programming Using a Many-Core Framework 13which permits expression of the interconnection of the RCs in a hierarchi-cal fashion, an expression component which corresponds to the conventionalassembly languages for microprocessors and digital signal processors (DSPs),and a generation component which allows compile-time generation of coordi-nation and expression instances.

1.4.1 Expression Language

The MORA expression language is an imperative language with a very regularsyntax similar to other assembly languages: every line contains an instructionwhich consists of an operator followed by a list of operands The main differ-ences with other assembly languages are

• Typed operators: the type indicates the word size on which the operation

is performed, e.g., bit, byte, short

• Typed operands: operands are tuples indicating not only the address spacebut also the data type, i.e., word, row, column, or matrix

• Virtual registers and address banks: MORA has direct memory accessand no registers (an alternative view is that every memory location is aregister) Operations take the RAM addresses as operands; however, ‘vir-tual’ registers indicate where the result of an operation should be directed(RAM bank A/B, output L/R/both)

1.4.1.1 Instruction Structure

An instruction is of the general form:

instr ::= op nops dest opnd +

num ::=−(MEMSZ/2−1) (MEMSZ/2−1)

We illustrate these characteristics with an example The instruction for amultiply-accumulate of the first row of an 8× 8 matrix of 16-bit integers withthe first column of an 8× 8 matrix in the data memory reads in full:

www.Ebook777.com

Định dạng
Số trang	478
Dung lượng	37,66 MB