Pipelined processor farms structured design for embedded parallel systems

P@elined Processor Farms Structured Design for Embedded Parallel Systems... Our work a t that time was concerned with implementing real-time docu- ment processing systems which included

Trang 1

P@elined Processor Farms Structured Design for Embedded Parallel Systems

Trang 2

This text is printed on acid-free paper @

Copy[.ight 0 2001 by Joho Wiley & Sons, Inc A l l ~iglirs rese~wrd

Published s i m o l ~ ~ n r o u r l y i n Cilnadn

No pun afthis publication may bc reproduced stored i n in retricvnl systcm or tlansmittcd in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or I08 of the 1976 United States Copyright Act, without either thc prior wrltren permission o f thc Publiahcr, or authorirat~on through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers M A

01923, (978) 750-8400 fax (978) 750-4744 Requests to the Publisher foi perinission should bc addressed to the Permissions Denalrment John Wilev & Sons Inc 605 Third Avenue NCM, Y O T ~

N Y 10158-0012, (212) 850-601 1 , fan (212)850-6008, E-Mail: PERMREQO WILEY.COM

For ordering and customer service, call I-800-CALL-WILEY

Library o f Congress Cataloging in Publicatioli Data is available

ISRN 0-471-22438-3

This title is also il\ililahle in print its ISBN 0-471-38860-2

Printed i n the United States o f America

Trang 3

Foreword

Parallel systems are typically difficult t o construct, t o analyse, and t o opti- mize One way forward is t o focus on stylized forms This is the approach taken here, for Pipclincd Processor Farms (PPF) The target domain is t h a t

of embedded systems with continuous flow of data, often with real-time constraints

This volume brings together the results of ten years study and development

of the PPF approach and is the first comprehensive treatment beyond the original research paperu The overall methodology is illustrated throughout

by a range of examples drawn from real applications These show both the scope for practical application and the range of choices for parallelism bot,h

in the pipelining and in the processor farms a t each pipeline stage Freedom

to choose the iiurnbers of processors for each stage is then a key factor for balancing the syst,em and for optimizing performance characteristics such as systeni throughput and latency Designs may also be optimized in other ways, e.g for cost, or tuned for alternative choices of processor, including future ones, providing a high degree of future-proofing for PPF designs

Trang 4

An important aspect is the ability t o do "what if' analysis, assisted in part

by a prototype toolkit, and founded on validation of predicted performance against real

As the exposition proceeds, the reader will get a n emerging understanding

of designs being crafted quantitatively for desired performance characteristics This in turn feeds in t o larger scale issues and trade-offs between requirements, functionality, benefits, performance, and cost The essence for me is captured

by the phrase "engineering in the performance dimension"

CHRIS WADSWORTH TECHNICAL CO-ORDINATOR

EPSRC PROG R AMME ON

PORTABI.E SOFTWA R E TOOLS FOR PARALLEL ARCH I TECTURES

Trang 5

Preface

In the 1980s, the advent of the transputer led t o widespread investigation of the potential of parallel computing in embedded applications Application areas included signal processing, control, robotics, real-time systems, image processing, pattern analysis and computer vision It quickly became apparent

t h a t although the transputer provided an effective parallel hardware component, and its msociated language Occam provided useful low-level software tools, there was also a need for higher-level tools together with a systematic design methodology that addressed the additional design parameters introduced by parallelism

Our work a t that time was concerned with implementing real-time docu- ment processing systems which included significant computer vision problems requiring multiple processors to meet throughput and latency constraints Reviews of similar work highlighted the fact that processor farms were often favored as an effective practical parallel implementation architecture, and

t h a t many applications embodied an inherent pipeline processing structure After analyaing a number of our own systems and those reported by others

we concluded t h a t a conibiriation of the pipeline structure with a generalized processor farm implelnentation at each pipeline stage offered a flexible general- purpose architecture for soft real-time systems We embarked upon a major project, PSTESPA (Portable Software Tools for Embedded Signal Processing Applications) to iuvestigate t h e scope of the Pipeline Processor Farm ( P P F ) design model, both in terms of its application potential and the supporting software tools it required Because the project focused mostly upon high-level

Trang 6

design issues, its outcome largely remains valid despite seismic changes within the parallel computing industry

By the cud of our PSTESPA project, nolwilhstanding ils successful oul- come, the goalposts of parallel systems had moved, and it was becoming apparent that Inany of the a~nhitious a n d idealistic goals of general-purpose parallel computing had been tempered by the pragmatic reality of market forces Companies such as Inmos, Meiko, Parsys and Parsytec (producing transputer-based machines), a1d ICL, AMT, MasPar and Thinking Machines (producing SIMD machines), found t h a t the market for parallel applications was too fragmented t o support high-volume sales of large-scale parallel machines based upon specialized processing elements, and t h a t ;~pplication development was slow and difficult with limited supporting software tools Shared- memory machines produced by major uniprocessor manufacturers such as

IBM, DEC, Intel and Silicon Graphics, and distributed Networks of Work- stations (NOWs) had however established a foothold in the market, because they are based around high-volume rommerrial off-the-shelf (COTS) processors, and achieved penetration in markets such a databare and fileserving where parallelism could be supported within the operating system

In our own application field of embedded systems, NOWs and shared- memory machines have a significant part t o play in supporting the parallel logic development process, but implementation is now increasingly geared t* wards hardware-software co-design Co-design tools may currently be based around heterogeneous computing elements ranging from conventional RISC and DSP processors a t one end of the spectrum, through embedded processor cores such as ARM, to FPGAs and ASICs a t the other Historically, such tools have been developed bottom-up, a n d therefore currently betray a strong hardware design ethos, and a correspondingly weak high-level software design model Our current research (also funded by EPSRC) is investigating how t o extend the P P F design methodology t o address this rapidly developing embedded applications market using a software component-based approach, which

we believe cat1 provide a valuable method of unifying current disparate low- level hardware-software cwdesign models Such solutions will surely become essential as complex multimedia embedded applications become widespread

in consumer, commercial and indust,rial mmarketn nver the next decade

A N DY DOWNTON

Colchester, Uct ZOO0

Trang 7

Acknowledgments

Although this book has only two named authors, many others have contributed t o its content, both by carrying out experimental work and by col- laborating in writing the journal and conference papers from which the book

B T Laboratories through the support of Mike Whybray

Many people a t BT contributed t o this work through the provision of H.261 image coding software, and (later) other application codes for speech recognition and microphone beam forming Other software applications, including those for model-based coding, H.263, and Eigenfaces were also investigated

in collaboration with BT In addition t o Mike Whybray, many others at B T laboratories provided valuable support for work there, including Pat Mulroy, Mike Nilsson, Bill Welsh, Mark Sharkl~ton, John Talintyre, Simon R.ingland and Alwyn Lewis BT also donated equipment, including a Meiko CS2 and Texas TMS320C40 DSP systems t o support our xtivities

Trang 8

As a result of these early studies, funding was obtained from the EPSRC (the UK Engineering and Physical Sciences Research Council) to investigate the emergent PPF design methodology under a directed program on Portable Software Tools for Parallel Architectures (PSTPA) This project - PSTESPA (Parallel Software Tools for Embedded Signal Processing Applications) - en- abled us not only t o generalise the earlier work, but also to start investigating and prototyping software tools to support the P P F design process Chris Wadsworth from Rutherford Appleton Laboratories was the technical coor- dinator of this program, and has our heartfelt thanks for the support and guidance he provided over a period of nearly four years Adrian Clark, with extensive previous experience of parallel image processing libraries, acted as

a consultant on the PSTESPA project, and Martin Fleury was appointed as our first research fellow, distinguishing himself so much that before the end of the project he had bee11 appoililed l o ihe Department's academic staff Sev- eral other research fellows also worked alongside Martin during the project: Herkole Sava, Nilufer Sarvan, Richard Durrant and Graeme Sweeney, and all contributed considerably to its successful outcome, as is evidenced by their co-authorship of many of the publications which were generated

Publication of this book is possible not only because of the contributions

of the many collaborators listed above, hut also through the kind permission

of the publishers of our journal papers, who have permitted us t o revise our original publications t o present a complete and coherent picture of our work here We particularly wish t o acknowledge the following sources of tables, figures and text extracts which are reproduced from previous publications: The Institution of Electrzcal Engineers (IEE), for perlnissiol~ t o reprint: portions of A C Downton, R W S Tregidgo, and A Cuhadar, Top- down Strurturrd parall~lizat,ion of embedded image processing applications IEE Proceedings Part I (Vision, Image, and Signal Processing),

141(6):438-445, 1994 as text in Chapter 1, as Figure 1.1 and A.l-A.4, and as Table A.l;

portions of M Fleury, A C Downton, and A F Clark, Scheduling schems for d a t a farming, IEE Proceedings Part E (Computers and Dig- ital Techniques), in press at the time of writing, as text in Chapter 6,

as Figures 6.1-6.9, and as Tables 6.1 and 6.2;

portions of A C Downton, Generalised approach to parallelising im-

age sequence coding algorithms, IEE Proceedings I (Vision, Image, and Signal Processing), 141(6):438-445, 1994 as text in Section 8.1, as Fig-

ures A.6-8.12, and as Tables 8.1 and 8.2:

portions of H P Sava, M Fleury, A C Downton, and A F Clark, Parallel pipeline implementation of wavelet transforms, IEE Proceedings Part I (Vision, Image, and Signal Processing), 144(6):355-359, 1997 as

text in Section 9.2, and as Figures 9.6-9.10;

Trang 9

portions of M Fleury, A C Downton, and A 1.' Clark, Scheduling schemes for data farming, IEE Proceedings P a r t E (Computers and Digital Techniques), 146(5):227-234, 1994 as text in Section 11.9, as

Figures 11.11-11.17, and as Table 11.6;

portions of M Fleury, H Sava, A C Downton, and A F Clark, De- sign of a clock synchronization sub-system for parallel embedded systems, IEE Proceedings P a r t E (Computers and Digital Techniques),

144(2):65-73, 1997 as text in Chapter 12, as Figures 12.1-12.4, and

as Tables 12.land 12.2

Elsevier Science, for inclusion of the following:

portions reprinted from M~croprocessors and Microsystems, 21, A Cuhadar,

A C Downton, and M Fleury, A structured parallel design for embed-

mission from Elsevier Science, as text in Chapter 3, ay Figures 3.1-3.10,

and as Table 3.1 and 3.2;

portions reprinted from Image and Vwion Computing, M Fleury, A

F Clark, and A C Downton, Prototyping optical-flow algorithms on a

parallel machine, in press a t the time of writing, Copyright 2000, with

permission from Elsevier Science, as text in Section 8.4, as Figures 8.19- 8.28, and as Tables 8.8-8.12;

portions of Signal Processing: Image Communications, 7, A C Down- ton, Speed-up trend analysis for H.261 and model-based image coding al-

gorithms using a parallel-pipeline model, 489-502, Cupyright 1995, with permission from Elsevier Science, as text in Section 10.2, Figures 10.5-

10.7, and Table 10.2

Springer Verlag, for permission t o reprint:

porlious of H P Sam, M Fleury, A C Donwton, and A F Clark,

A case study in pipeline processor farming: Parallelising the H.263 en-

coder, in UK Parallel'96, 196-205, 1996, as text in Section 8.2, as Fig- ures 8.13-8.15, and as Tables 8.3-8.5;

portions of M Fleury, A C Downton, and A F Clark, Pipelined parallelization of face recognition, Machine Vision Applications, in press

a t the time of writing, as text in Section 8.3, Figures 5.1 and 5.2, Fig-

ures 8.16-8.18, and Tables 8.6 and 8.7;

portions of M Fleury, A C Downton, and A F Clark, Karhunen-Loeve transform: An exercise in simple image-processing parallel pipelines, in Euro-Par'97, 815-819, 1997, as text in Section 9.1, Figures 9.4-9.5;

portions of M Fleury, A C Downton, and A F Clark, Parallel structure in an integrated speech-recognition network, in Euro-Par'99, 995-

1004, 1999, as text in Section 10.1, Figures 10.1-10.4, and Table 10.1

Trang 10

Academic Press, for permission t o reprint:

portions of A Cuhadar, D G Sxnpson, and A C Downton, A scdable

parallel approach to vector quantization, Real-Tame Imaging, 2:241-247,

1995, as text in Section 9.3, Figures 9.11-9.19, and Table 9.2

The Institute of Electrical and Electronic Engineers (IEEE), for permission

t o reprint:

portions of M Fleury, A C Downton, and A F Clark, performance metrics for embedded parallel pipelines, IEEE Transactions in Parallel and Distributed Systems, in press a t the time of writing, ay text in Chapter 11, Figures 2.2-2.4, Figures 11.1-11.10, and as and Tables 11.1- 11.5

John Wiley & Sons Limited, for inclusion oE

portions of Constructiug generic data-farm templates, M Fleury, A

C Downton, and A F Clark, C o n c u r ~ n c y : Practice and Ezperience,

11(9):1-20, 1999, @.lohn Wiley & Sons Limited, reproduced with permission, as text in Chapter 7 and Figures 7.1-7.7

The typescript of this book was typeset by the authors using B W , MikTex and WinEdt

A C D and M F

Trang 11

1.3 Arndahl's Law and Structured Parallel Design

1 , Introduction to I'PF Systems

Trang 12

xiv CONTENTS

2.2 Pipeline Types

2.2.1 Asynchronous PPF

2.2.2 Synchronous PPF

2.9 Data Farming and Demand-based Scheduling

2.4 Data-jam Performance Criteria

9.2 Parallelization of the Postcode Recognizer

3.2.1 Partitioning the postcode recognizer

3.2.2 Scaling the postcode recognizer

3.2.9 Perfor7rbance achieved

3.3 Pamllelization of the address verifier

3.3.1 Partitioning the address verifier

3.3.2 Scaling the address verifier

3.3.3 Addms.? verification farms

3.3.4 Overall performance achieved

3.4 Meeting the Specification

Trang 13

7.3 Parallel Logic Implementation

7.4 Target Machine Implementation

7.4.1 Common implementation issues

7.5 'NOW' Implementation for Logic Debugging 7.6 Target Machine Implementations for Performance Tuning

7.7 Patterns and Templates

estimation

Trang 14

8.1.4 T e r picture' quantization with motion

estimation 8.1.5 Implementation of the parallel encoders 8.1.6 H.261 encoders without motion estimatio 8.1.7 H.261 encoder with motion estimation 8.1.8 Edge data exchange

8.2 Case Study 2: H263 Encoder/Decoder

8.2.1 Static analysis of H.263 a1,qorithm 8.2.2 Results from parallelizing H.263

8.3 Case Study 3: 'Eigenfaces' - Face Detection 8.3.1 Background

Trang 15

10.2 Case Study 2: Model-based Coding

10.2.1 Pamllelization of the model-based coder

Part IV Underlying Theory and Analysis

11.3 Gathering Performance Data

11.4 Performance Prediction Equations

11.5 Results

11.5.1 Prediction results

Trang 16

11.6 Simulation Results

11.7 Asynchronous Pipeline Estimate

11.8 Ordering Constraints

11.9 Task Scheduling

11.9.1 Uniform task size

11.9.2 Decreasing task size

11.9.9 Heuristic scheduling schemes

12.5 Establishing a Refresh Interval

12.6 Local Clock Adjustment

12.7 Implementation on the Pammid

Trang 17

Advanced Graphics Protocol

Application Programming Interface

Analysis, Prediction, Template Toolkit

Autoregressive

Application Specific Integrated Circuits

Automatic Target Recognition

Abstract Window Toolkit

Berkeley Standard Distribution

Bulk Synchronous Parallel

International Consultative Committee for Telephone and Telegraph

Curr~ulalive Distribution Function

Categorical D a t a Type

Common Intermediate Format

Commercial Off-The-Shelf

Central Processing Unit

Comrnu~licating Sequential Processes

Central Synchronization Server

Trang 18

Continuous Wavelet Transform

Directed Acyclic Graph

Distributed Component Object Model

Discrete Cosine Transform

Digital Signal Processor

Digital Versatile Disc

Discrete Wavelet Transform

Fibre Distributed Data Interface

F a t Fourier Transform

First-In-First-Out

Finite Impulse Response

Field Programmable Gate Arrays

International Business Machines

Inverse Fast Fourier Transform

Increasing Failure Rate

International Standards Organization

International Telecommunications Union Just-in-Time

Joint Photographic Experts Group Karhunen-Loeve Transform Local Area Network Large Vocabulary Continuous-Speech Recognition Light-Weight Process

Multiply Accumulate Operation Motion Estimation

Multiple Instruction Multiple DaVa Streams Massachusetts Institute of Technology Multimedia Extension

Motion Picture Experts Group

Trang 19

Non-Uniform Memory Access

Optical Character Recognition

Optical Flow

Object-oriented Coding

Personal Computer

Principal Components Algorithm

Probability Dist,ribution Fuliction

Processing Element

PoUac~ek-Khintchine

Portable Operating System-IX

Pipelined Processor Farms

Peak Signal-teNoise Ratio

Public System Telephone Network Parallel Virtual Machine

Reduced Invtruction Set Computer Remote Method Invocation

Remote Procedure Call

Run-time Executive

Real-time Operating System

Synthetic Aperture Radar

Small Computer System Interface Single Instruction Multiple Data Streams Symmetric Multiprocrssvr

Semantic Neural Network

Series Parallel Graph Sum-of-Squared-Differences Safe Self-scheduling

Short-Time Fourier Transform Trademark

Universal Time Coordinated

Trang 20

with respect t o

Wavelet Series

World Wide Web

Trang 21

Part I

Introduction and Basic

Concepts

Trang 23

the design process I t appears that the potential offered by these additional design choices has led to an insistence by designers on obtaining maximum performance, with a consequent loss of generality This is not surprising, because parallel solutions are typically investigated for the very reason that conventional sequential systems do not provide sufficient performance, but it ignores the benefits of generality which are accepted by sequential programmers The sequential programming paradigm, or rather the abstract model of a computer

on which it rests, was introduced by von Neumann [45] and has persisted ever since iespite the evident internal parallelism in most microprocessor designs (pipelined, vector, and superscalar [115]) and the obvious bottleneck if there is

just one memory-access path from the central processing unit (CPU) for data

'Strictly, the term serial processing is more appropriate, a s processing takes place on a serial machine or processor The term sequential processing implies that the algorithms being processed are inherently sequential, whereas in fact they may coittairl parallel components TTawewr, this book retair~s corrlrrlon usage and takes s~quential processing t o be synonymous

with serial processing

Trang 24

2 INTRODUCTION

and instructions alike The model suits the way many programmers envisage the execution of their programs (a single step a t a time), perhaps because errors are easier t o find than when there k an interleaving of program order

as in parallel or concurrent programming paradigm^.^

The Pipelined Processor Farms (PPF) design model, the subject of this book, can be applied in its simplest form t o any Multiple Instruction M111- tiple Data streams (MIMD) [I141 multiprocessor ~ y s t e m ~ Single Instruc- tion Multiple Data streams (SIMD) computer architecture, though current a t the very-large scale integration (VLSI) chip-level, and t o a lesser extent in multimedia-extension (MMX) microprocessor instructions for graphics support a t the processor level [212], is largely defunct a t the processor level,

with a few honorable exceptions such as Cambridge Memory System's DAP and the MasPar series of machines [13].~ Of the two categories of MIMD

machines, the primary concentration is upon distributed-memory machines, where the address space is partitioned logically and physically between p r e cessors However, it is equally possible t o logically partition shared-memory machines, where there is a global address space The boundaries between distributed and shared-memory machines have dissolved in recent times [70],

a point t o be returned t o in Chapter 13

1.2 ORIGINS

The origins of the P P F design method arose in the late 1980s a s a result

of research carried out a t the University of Essex to design and implement

a real-time postcode/address recognition system for the British Post Office (see Chapter 3 for a description of the outcome of this process) Initial investigation of the image analysis and pattern recognition problems demon- strated t h a t significant research and development was needed before any kind

of working demonstrator could be produced, and that, of necessity, the first demonstrator would need to be a nun-real-time software simulation running

on a workstation This provided the flexibility t o enable easy experimental evaluation and algorithm updates using offline databases of address images,

2Shared-memory machines can also relax read-write access across the processor set ranging from strattg to weak consistency, presenting a corrtir#uurrt of programming paradigms ['A591

3Categorisation of processors by the multiplicity of parallel data and instruction streams supported is a well-known extension of von Neumann's model [65]

4Systolic arrays are also used far fine-grained, signal processing [ZOO] though largely again

at the VTST level I n systolic designs, data are pumped synchronously across an array

of processing elements (PEs) At each step a different stage in processing takes place

Wavefront processors are an asynchronous version of the systolic architecture Other forms

of instruction level ~arallelism are verv-larae instruction word IVLlWl DSPs Idipital - - signal proces,uri) and 11s iariaut expl:citly yara.lrl invruction coll.p;riop, (EPIC) :319: ~ l l e r d e a

"f trilt~sfi.rritlg 51\11) a l l a y s ,urI~ aj rile D,\P to \.l.il icas also been moored '1'11e 1>11' 'clnp

[66] is an experimental and nwel SIMD VLSI array

Trang 25

and also a starting point for consideration of real-time implementation issues

In short, solving the problem at all was very difficult; generating a real-time solution (requiring a throughput of 10 envelope images/second, with a latency of no more than 8 seconds for processing each image) introduced a n additional dimension of processing speed which was beyond the bounds of available workstations

A literature survey of the field of parallel processing a t that time showed

that numerous papers had been published on parallelization of individual image processing, image coding and image analysis algorithms (see, e.g , [362]), many inspired by the success of the transputer [136] Most of these papers were of limited generality however, since they reported bespoke parallelization of specific well-known algorithms such as 2-D filters, FFTs, DCTs, edge detectors, component labeling, Hough transforms, wavelets, segmentation al- gorithlris, etc Significantly, examinatioli of many of tliese customized parallel algorithms revealed, in essence, the same solution; that of the single, demand- b-ed, data farm

Practical image analysis and pattern recognition applications, however, typically contain a number of algorithms implemented together as a complete system Like the postal address reading application, the CCITT H.261 en- coderfdecoder algorithm 1491 is also a good illustration of this characteristic, since it includes algorithms for discrete cosine transformation (DCT), mc- tion estimation and compensation, various filters, quantizers, variable length coding, and inverse versions of several of these algorithms Very few papers addressed the issue of parallelizing complete systems, in which individual algorithm parallelization could be exploited as components Therefore, a clue t o

an appropriate generic parallel architecture for embedded applications was t o view the demand-based processor farm as a component within a higher-level system framework

From our point of view, parallel processing was also simply a means t o a n end, rather than a n end in itself Our interest was in developing a general vystem design method for MIMD parallel processors, which could be applied after or during the initial iterative algorithm development phase Too great

a focus on performance at the expense of generality would inevitably have resulted in both implementations and design skills that rapidly became ob- solete We therefore aimed t o support the early, architecture independent stages of the design process, where parallelization of complete image processing applications is considered, by a process analogous to stepwise refinement

in sequential program design 1312, 3351 Among the advantages of the PPF

design methodology which resulted are the following:

Upper bound (idealized) throughput scaling of the application is easily defined, and aspects of the application which limit scaling are identified

Input/output latency is also defined and can be controlled

Trang 26

Design effort is focused on each performance bottleneck of each pipeline stage in turn, by identifying the throughput, latency, and scalability

1.3 AMDAHL'S LAW A N D STRUCTURED PARALLEL DESIGN

Amdahl's law [15, 161 is the Ohm's law of parallel computing It predicts an

upper bound t o the performance of systems which contain both parallelization and inherently sequential components Amdahl's law states that the scaling performance of a parallel algorithm is limited by the number of inherently sequential operations in that algorithm Consider a problem where a fraction

f of the work must be performed sequentially The speed-up, S, possible from

a machine with N processors is:

I f f = 0.2 for example (i.e 20% of the algorithm is inherently sequential), then the maximum speedup however many processors are added is 5

As will be shown in later chapters, applying Amdahl's law t o multi-algorithm embedded systems demonstrates that the scaling which can be achieved is largely defined, not by the number of processors used, but by any residual sequential elements within the complete application algorithm Thus effective system parallelihation requires a method of minimizing the impact of residual sequential code, as well as of parallelizing the bulk of the application a l g e rithm In the PPF design methodology, pipelining is used t o overlap residual sequential code execution with other forms of parallelism

diameter is also restricted The commercial off-the-shelf (COTS) processors used within such machines will outstrip the available interconnect bandwidth

Trang 27

INTRODUCTION TO PPF SYSTEMS 5

if combined in large configurations since such processors were not designed with modularity in mind To avoid this problem in P P F , a pipeline is partitioned into s number of stagcs, cach onc of which may hc parallcl P P F is primarily aimed a t continuous-flow systems in the field of signal processing, image-processing, and multimedia in general

A continuous-flow system is one in which data never cease t o arrive, for example a radar processor which must always monitor air traffic These systems frequently need t o meet a variety of throughput, latency, and output-ordering specifications It becomes necessary t o he able t o predict performance, and

t o provide a structure which permits performance scaling, by incremental addition of processors and/or transfer t o higher performance hardware once the initial design is complete The hard facts of achievable performance in a

parallel system are further discussed in Section 2.4

There are two basic or elementary types of pipeline components: asynchronous and synchronous, though many pipelined systems will contain some segments of each type P P F caters for any type of pipeline, whether synchronous, asynchronous or mixed; their performance characteristics are dis-

cussed in detail in Section 2.2 P i p e l i ~ ~ e systems are a natural choice for some synchronous applications For example, a systolic pipeline-partitioning methodology exist,s for signal-prol:msing algorit,hms wit,h a regnlar pattern

[237] Alternatively, [8] notice that there is an asynchronous pipeline struc-

ture t o the mind's method of processing visual input which also maps onto

computer hardware If all information flow is in the forward direction [8] then

the partitions of the pipeline mirror the peripheral, attentive, and cognitive

stages of human vision [232] The CMU Warp [18], the Cytocomputer [341],

PETAL and VAP [56] are early examples of machines used in pipelined fashion for image processing."nput t o the pipeline either takes the form of a

succession of images grouped into a batch (medical slides, satellite images, video frames and the like) or raster-scan in which a stream of pixels is input

in the same order as a video camera scans a scene that is in horizontal, zig- zag fashion P P F generalizes the pipeline away from bespoke hardware and away t o some extent from regular problems Examples of applicable irregu- lar, continuous-flow systems can be found in vision [50] (see Chapter 3), radar

[97], speech-recognition processing [133], and data compression [52] Chap-

ters 8 and 9 give further detailed case studies where P P F has been consciously

applied

P P F is very much a systems approach t o design, that is, it considers the entire system before the individual components Another way of saying this is that P P F is a top-down as opposed t o a bottom-up design methodology For

some years it has been noted [214] that many reported algorithm examples

merely form a sub-system of a vision-processing system while it is a complete

SThe common idea across these machines is t o avoid the expense of a 2D systolic array by

using a linear systolic array

Trang 28

6 INTRODUCTION

system that forms a pipeline Various systems approaches t o pipeline implementation are then possible With a problem-driven approach it may be difficult to assess the advalllages and disadvantages olalternalive architectures for any one stage of a problem However, equally an architecturedriven design ties a system down t o a restricted range of computer hardware In P P F , the intention is t o design a software structure that, when suitably parameterized, can map onto a variety of machines Looking aside t o a different field, Oracle has ported its relational database system t o a number of apparently dissimi- lar parallel computers [337] including the Sequent Symmetry shared-memory machine and the nCube2 MIMD message-passing computer Analogously t o the database abstract machine, the software pipeline is a flexible structure for the PPF problem domain

Having settled on a software pipeline, there are various forms of exploitable parallelism t o be considered The most obvious form of parallelism is temporal multiplexing, whereby several complete tasks are processed simultaneously, without decomposing individual tasks However, simply increasing the degree

of temporal multiplexing, though it can improve the mean throughput, does not change the latency experienced by a n individual task To reduce pipeline traversal latency, each task must be decomposed t o allow the component parts

t o experience their latency in parallel Geometric parallelism (decomposing

by some partition of the data) or algorithmic parallelism (decomposition by function) are the two main possibilities available for irregularly structured code on medium-grained p r o c e s ~ o r s ~ After geometric decomposition, data must be multiplexed by a farmer process across the processor farm which is why in P P F data parallelism is alternatively termed geometric multiplexing When a processor farm utilizes geometric multiplexing, it is called a data farm, and certainly the term data farm is more common in the l i t e r a t ~ r e ~ This book does not include many examples of algorithmic parallelism, not

by intent but because the practical opportunities of exploiting this form of parallelism are limited An early analysis [277] in the field of single-algorithm

image processing established both the difficulty of finding suitable algorithmic decompositions and the limited speed-up achievable by functional deeomposi- tion However, algorithmic parallelism does have a role in certain applications, which is why it is not discounted in PPF For example, pattern matching may employ a parallel search [202], a form of OR-parallelism, whereby alternative searches take place though only the result of successful searches are retained?

%Dataflow computers [340] have been proposed as a way of exploiting the parallelism inher-

ent in irregularly structured code ( i e code in which there are many decision points result-

ing in branching), but though there are research processors [79], no commercial dataflow computer has ever been produced

'TIIP tern\ dara yaralleliwn is an alteruarivp ro grumerrir parallrlidm, b ~ t t l ~ i c term bas the ditficulr) Ihar dara parallelism iu wociarecl wirll parallel drcomyostti~~n of regular code (i.e code with few branch points) by a parallel compiler

8Divide-and-conquer search algorithms may be termed AND-parallelism, as the result of parallel searches may be combined through an AND-tree (2941

Trang 29

INTRODUCTION TO PPF SYSTEMS 7

Bringing together the preceding discussion, it can be stated that:

1 A data sct can bc subdivided over multiple processors (data parallelism

or geometric multiplexing)

2 The algorithm can be partitioned over multiple processors (algorithmic

parallelism)

3 Multiple processors can each process one complete task in parallel (prw

cessor farming or temporal multiplexing)

4 The algorithm can be partitioned serially over multiple processors (pipelining) (pipelining being an instance of algorithmic parallelism)

5 The four basic approaches outlined above can be combined as appropriate

The field of low-level image processing [74] illustrates how these forms of

parallelism can be applied within a processor farm:

G e o m e t r i c multiplexing An example of geometric multiplexing is where a frame of iinage data is decomposed onto a grid of processors Typical low-level image-processing operations such a s convolution and filtering can then be carried out independently on each sub-image requiring ref- erence only t o the four nearest neighbor processors for boundary information To adapt such operations t o a processor farm, the required boundary information for each processor can be included in the original

d a t a packet sent to the processor

A l g o r i t h m i c parallelism In the case of algorithmic parallelism, different parts of an algorithm which are capable of concurrent execution can be farmed t o different processors, for example the two convolutions with horizontal and vertical masks could be executed on separate processors concurrently in the case of a Sobel edge detector [290, 751 The advantage of a processor farm in this context is that no explicit synchronization of processors is required; however, the algorithm itself normally defines explicitly the possible degree of parallelism (i.e incremental scaling is not possible)

T e m p o r a l multiplexing Applying each of a sequence of images t o a separate processor does not speed up the time to process an individual image, but enables the average system throughput t o be scaled up in direct proportion t o the number of processors used The approach is limited by the allowable latency between the input and output of the system, which is not reduced by temporal parallelism

P i p e l i n i n g Pure pipelining has the same effect as temporal multiplexing in speeding up overall application throughput without reducing the latency

Trang 30

t o 0.25 tasks/second (limited by t h e slowest pipeline stage), a speedup

of 2.5 Note however that the latency (delay between task input and task output) increases from 10 seconds for the sequential algorithm t o 15

seconds (3 x 4 seconds + 3 seconds for the final stage) for the unbalanced pipeline shown in Fig l l a )

The role of pipelining within the PPF design philosophy is t o increase throughput and reduce latency by allowing necessarily independent components of an application (some of which may be inherently sequential) t o be overlapped

By combining the techniques described above, and mapping a PPF architecture onto the pipeline of stages which comprises any embedded application, both the throughput and the latency of the application can be scaled Fig l l b illustrates the effect of using temporal multiplexing alone t o achieve throughput scaling: when the throughput of each pipeline stage is matched

a t 1 task/second, a speedup of 10 is achieved with the same latency as: the original sequential algorithm Of course, exactly the same throughput scaling (with unchanged latency) could be achieved using a single processor farm, with each processor executing a copy of the complete application The reason for using a pipeline instead is t o break down the overall application into its sub-components, so that data or algorithmic parallelism can be exploited t o reduce latency as well as increase throughput

Finally, Fig l l c illustrates the exploitation of data or algorithmic parallelism in each pipeline stage instead of temporal multiplexing: in this case, the same speedup of 10 is achieved, but with a reduction of latency t o 4 seconds

Appendix A l below illustrates how bmic profiling data, extracted from exe-

cution of a sequential image coding algorithm, can be used t o guide the PPF

design process t o achieve a scalahle parallel implemrnt,at.k,n of the algorithm with analytically defined performance bounds

1.5 CONCLUSIONS

The primary requirement in parallelizing embedded applications is t o meet

a particular specification for throughput and latency The Pipeline Proces- sor Farm (PPF) design model maps conveniently onto the software structure

of many continuous data flow embedded applications, provides incrementally scalable performance, and enables upper-bound scaling performance t o be easily estimated from profiling data generated by the original sequential imple-

Trang 31

Pcr-stage 1 2 s 4 s

lat

a) Simple pipeline Throughput = 0.25 jobsls Latency = 15 s

Trang 32

10 INTRODUCTION

mentation Using the PPF model, sequential sub-components of the complete application are identified from which data or algorithmic parallelism can be easily extracted Where neither of these forms-of parallelism is exploitable (i.e the residual sequential components identified in Amdahl's law), temporal multiplexing can often be used t o match pipeline throughput without reducing latency Each pipeline stage will then normally map directly onto the major functional blocks of the software implementation, written in any procedural language Furthermore, the exact degree of parallelization of each block required t o balance the pipeline can he determined directly from its sequential execution time

Appendix

A l SIMPLE DESIGN EXAMPLE: T H E H.261 DECODER

Image sequence coding algorithms are well known t o be computationally intensive, due in part t o the massive continuous input/output required t o process

up t o 25 or 30 image frames per second, and in part t o the computational complexity of the underlying algorithms In fact, it was noted (in 1992) [380] that it was only just possible t o implement the full H.261 encoder algorithm for quarter-CIF (176 x 144 pixels) images in real time on DSP chips such

as the TMS 320C30 In this case study, a non-real-time H.261 decoder a l g e rithm developed for standards work a t B T Laboratories and written in C, was parallelized t o speed up execution on a n MIMD transputer-based Meiko Com- puting Surface Results presented are based upon execution times measured when the H.261 algorithm was run on sequences of 352 x 288 pixel common intermediate format (CIF) images

Fig A.1 shows a simplified representation of the H.261 decoder architecture The decoder consists of a 3-stage pipeline of processes, with feedback

of the previous picture applied around the second stage Feedback within

a pipeline is a key constraint on parallelism, since it restricts the degree to which temporal multiplexing can be exploited: in the H.261 decoder, the reconstructed previous frame is used to construct the current frame from the decoded difference picture

Table A.l summarizes the most computationally intensive functions within the B T H.261 decoder, and is derived from statistics generated by the Sun profiling tool gprof [138] while running the decoder on 30 image frames of data on a Sparc2 processor To simplify interpretation, processing times have been normalized for one frame of data The 10 functions listed in the table constitute 99.2% of total execution time

Program execution of the H.261 decoder can be broken down on aper-frame basis into a pipeline of three major components:

T I frame initialization (functions 1 and 2 in Table A.l);

Trang 33

SIMPLE DESIGN EXAMPLE: THE H.261 DECODER 11

Fig A.1 Simplified representation of the H.261 decoder execution timing

Table A l Summay Execution Profile Statistics for the H.261 Decoder Sequence Function name Normalized Execution Time (s)

previous picture

5.513s

Trang 34

12 INTRODUCTION

T2 frame decoder loop (functions 3-8 in 'I'able A.l); and

T3 frame output (functions 9 and 10 in Table A.l)

The first and last of these components are executed once for each image frame, whereas the middle component contains considerable data parallelism

and involves a loop executed 396 times (once for each 16 x 16 pixel macroblock making up a CIF picture) It is therefore clear that considerable scope exists for speeding up the middle stage of the pipeline by exploiting data parallelism Temporal multiplexing cannot be utilized because each image frame is reconstructed by means of a difference picture added t o the motion-compensated previous frame (although it would be possible t o partially overlap the decod- ing of consecutive frames) Since pipeline stages T1 and T3 are inherently sequential, direct application of Amdahl's law t o the data in Fig A.1 shows that f = 0.22, giving a maximum speedup of only 4.55 An asymptotic approach t o this speedup could be obtained by parallelizing the decoder using

a single processor farm, with the data-parallel component T2 farmed onto worker processors, and the remaining code executed on the master processor Thc upper-bound predicted speedup for the P P F is presented graphically

in Fig A.2 and may be represented theoretically by the followirig piecewise approximation:

where the first and last stages of the P P F contain a single processor, the second processor farm stage contains n - 2 processors, and TI-T3 are execution times

of the three stages nf the pipeline shown in Fig A.2 As the throughput for

a P P F is defined solely by the slowest pipeline stage, its speedup is given by the ratio of sequential application execution time t o the execution time for this stage alone (this illustrates the advantage of the pipeline in overlapping execution of residual sequential components) Where (as in this case) the

slowest stage is perfectly parallelizable (i.e it contains no residual sequential

elements and thus f = 0 in Amdahl's law), linear speedup is obtained up t o the point where the scaled ytage is no longer the slowest The first equation defines this case where the performance is increasing linearly as the number of workers in the processor farm increases (S is proportional t o n ) ; this continues until the execution time for the processor farm drops below that of the next slowest stage, T 3 in this case The second equation then defines the fixed scaling achieved for any further increase in processor numbers (S is fixed and independent of n)

It is assumed that the processor farm implementing the middle stage of the pipeline receives its work packets directly from the first stage and passes

Trang 35

SIMPLE DESIGN EXAMPLE; THE H.261 DECODER 13

Fig A 2 Idealized and actual speedup for the H.261 decoder

its results directly t o the third stage, a s in the topology of Fig A.3, where

a n implementation with five worker processors is shown T h e analysis is of

course idealized, ignores communication overheads, and assumes static task characteristics As can he seen from Fig A.2, the performance is predicted

t o scale linearly up t o six workers (8 processors total)

Fig A 3 PPF topology for a 3-stage pipeline with 5 workers in the second stage

Trang 36

14 INTRODUCTION

Actual scaling performance results are also presented in Fig A.2, for two

different practical cases In both cases, the scaling performance is less than that predicted in the idealized graph, due t o communication overheads being neglected, but the general shape of t h e graphs is in other respects as predicted The maximum speedup obtained (5.59) exceeds the limit predi~ted

by Amdahl's law, thus demonstrating t h e advantage which the P P F has compared with a single processor farm implementation In practice, transputer communication links do not provide sufficient bandwidth for real-time communication of H.261 CIF picture data structures, and therefore communication

overheads substantially limit the performance scaling which can be achieved

in a transputer-based system On the AMD Sharc family of processors with six link ports, real-time parallel processing of images sequences is far more

practicable For example, the ADSP-21160 1141, running a t 100 MHz, sup- ports 'glueless' multiprocessi~~g~ and floating point like the transputer, hut now is superscalar, with a maximum of six issues per cycle

In the first implementation, each image was simply subdivided into a number of horiaontal strips defined by the number of processor farm workers, in line with the idealized model of data parallelism presented earlier As can

be seen from Fig A.4(a), this results in a series of black strips in the reconstructed image, where data adjacent t o each worker'y sub-image were not available for constructing the motion compensated previous image In the second implementation, additional rows of macroblocks a t the boundaries of the sub-image processed by each farm worker were exchanged in a second communication phase between the master and worker processors in the prw cessor farm, after the difference image had been decoded This enables the full motion compensated previous image t o be reconstructed, as shown in Fig A.4(b), but results in an additional communication overhead, which de-

creases scaling performance compared with the case where edge data are not exchanged

Trang 37

SIMPLE DESIGN EXAMPLE: THE H.261 DECODER 15

Fig A.4 Sample image output by the pardel H.261 decoder with 5 workers (a) without edge data exchange and (b) with edge data exchange

Trang 39

2

Basic

-

Consider automatic target recognition (ATR) of aircraft found by Synthetic

design features of a PPF There is: a single flow of processing control through

a need to coordinate the flow of data across the hierarchy of ATR algorithms,

17

Pipelined Processor Farms: Structured Design for Embedded Parallel Systems

Martin Fleury, Andrew Downton Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-38860-2 (Hardback); 0-471-22438-3 (Electronic)

Trang 40

18 BASIC CONCEPTS

Having arrived a t a design for (say) a DSP processor at the computation layer, why consider an FPGA alternative? Why, in fact, partition an application between a communication structure and computation layer? The key problem a parallel system designer must face is how t o make a system scalable This is not simply because larger (or smaller) problems can be tackled solely

by adding hardware in an incremental fashion, without otherwise changing the design, important though a modular design remains Equally important

is that uniprocessor performance increases in proportion t o the number of transistors on a microchip, which has been observed t o double approximately every eighteen months (the well-known Moore's law') Therefore, a design tied to a specialized parallel machine may well be rapidly overtaken in terms

of price and performance by a uniprocessor implementation The principal reason for the shift t o COTS hardware is t o exploit the economies in scale that arise within the uniprocessor market, which lead t o exponential gains in performance In other words, by exchanging the computation hardware within the design, which can also be modular, a design is made doubly scalable, and hopefully future-proof As thc life cycle of s typical commercial microprocessor is less than five years, while the life time of many embedded products is much longer (e.g an avionics system has a lifetime greater than thirty years),

system (or code) portability [23] is an important method of amortizing the

investment in the original embedded software

A P P F design is a pipeline of processor farms The essence of a processor farm within P P F is one central farmer, manager, or controller process, and a

set of worker or slave processev spread across t h e processor farm Notice that there is no insistence in PPF on having a single worker process per processor, though in fact our farm template design (Chapter 7) does not exploit parallel slackness [329] by having more than one process t o a processor In a shared- memory MIMD machine, worker threads [I891 replace worker processes, a thread being a single line of instruction control existing in a shared or global address space The role of the farmer is not simply t o coordinate the activity

of the workers but additionally t o pass partially-processed work onto the next stage of processing Dy introducing modularity, each module being a farm,

it becomes possible t o cope with heterogeneous hardware, and separately to scale each farm as larger problems or versions of the original problem are tackled

P P F is appropriate for all applications with continuous data input/output,

a characteristic typical of soft, real-time, embedded systems.2 However, P P F

is by no means a panacea for all such embedded systems, and in Chapter 10,

'Mooreis law is named after Gordon Moore, co-founder and Chairman of Intel, who discov- ered the law in 1965

2Soft, real-time systems as opposed to hard, real-time systems are those in which respan- siwness t o deadlines can be relaxed Hard, real-time systems [216] usually involve the corltrol of rnacl~inery, such as in fly-by-wire aviortics and industrial manufacturing control, and are not the subject of this book

Định dạng
Số trang	328
Dung lượng	4,66 MB