P@elined Processor Farms Structured Design for Embedded Parallel Systems... Our work a t that time was concerned with implementing real-time docu- ment processing systems which included
Trang 1P@elined Processor Farms Structured Design for Embedded Parallel Systems
Trang 2This text is printed on acid-free paper @
Copy[.ight 0 2001 by Joho Wiley & Sons, Inc A l l ~iglirs rese~wrd
Published s i m o l ~ ~ n r o u r l y i n Cilnadn
No pun afthis publication may bc reproduced stored i n in retricvnl systcm or tlansmittcd in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or I08 of the 1976 United States Copyright Act, without either thc prior wrltren permission o f thc Publiahcr, or authorirat~on through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers M A
01923, (978) 750-8400 fax (978) 750-4744 Requests to the Publisher foi perinission should bc addressed to the Permissions Denalrment John Wilev & Sons Inc 605 Third Avenue NCM, Y O T ~
N Y 10158-0012, (212) 850-601 1 , fan (212)850-6008, E-Mail: PERMREQO WILEY.COM
For ordering and customer service, call I-800-CALL-WILEY
Library o f Congress Cataloging in Publicatioli Data is available
ISRN 0-471-22438-3
This title is also il\ililahle in print its ISBN 0-471-38860-2
Printed i n the United States o f America
Trang 3Foreword
Parallel systems are typically difficult t o construct, t o analyse, and t o opti- mize One way forward is t o focus on stylized forms This is the approach taken here, for Pipclincd Processor Farms (PPF) The target domain is t h a t
of embedded systems with continuous flow of data, often with real-time con- straints
This volume brings together the results of ten years study and development
of the PPF approach and is the first comprehensive treatment beyond the original research paperu The overall methodology is illustrated throughout
by a range of examples drawn from real applications These show both the scope for practical application and the range of choices for parallelism bot,h
in the pipelining and in the processor farms a t each pipeline stage Freedom
to choose the iiurnbers of processors for each stage is then a key factor for balancing the syst,em and for optimizing performance characteristics such as systeni throughput and latency Designs may also be optimized in other ways, e.g for cost, or tuned for alternative choices of processor, including future ones, providing a high degree of future-proofing for PPF designs
Trang 4An important aspect is the ability t o do "what if' analysis, assisted in part
by a prototype toolkit, and founded on validation of predicted performance against real
As the exposition proceeds, the reader will get a n emerging understanding
of designs being crafted quantitatively for desired performance characteristics This in turn feeds in t o larger scale issues and trade-offs between requirements, functionality, benefits, performance, and cost The essence for me is captured
by the phrase "engineering in the performance dimension"
CHRIS WADSWORTH TECHNICAL CO-ORDINATOR
EPSRC PROG R AMME ON
PORTABI.E SOFTWA R E TOOLS FOR PARALLEL ARCH I TECTURES
Trang 5Preface
In the 1980s, the advent of the transputer led t o widespread investigation of the potential of parallel computing in embedded applications Application areas included signal processing, control, robotics, real-time systems, image processing, pattern analysis and computer vision It quickly became apparent
t h a t although the transputer provided an effective parallel hardware compo- nent, and its msociated language Occam provided useful low-level software tools, there was also a need for higher-level tools together with a systematic design methodology that addressed the additional design parameters intro- duced by parallelism
Our work a t that time was concerned with implementing real-time docu- ment processing systems which included significant computer vision problems requiring multiple processors to meet throughput and latency constraints Reviews of similar work highlighted the fact that processor farms were of- ten favored as an effective practical parallel implementation architecture, and
t h a t many applications embodied an inherent pipeline processing structure After analyaing a number of our own systems and those reported by others
we concluded t h a t a conibiriation of the pipeline structure with a generalized processor farm implelnentation at each pipeline stage offered a flexible general- purpose architecture for soft real-time systems We embarked upon a major project, PSTESPA (Portable Software Tools for Embedded Signal Processing Applications) to iuvestigate t h e scope of the Pipeline Processor Farm ( P P F ) design model, both in terms of its application potential and the supporting software tools it required Because the project focused mostly upon high-level
Trang 6design issues, its outcome largely remains valid despite seismic changes within the parallel computing industry
By the cud of our PSTESPA project, nolwilhstanding ils successful oul- come, the goalposts of parallel systems had moved, and it was becoming apparent that Inany of the a~nhitious a n d idealistic goals of general-purpose parallel computing had been tempered by the pragmatic reality of market forces Companies such as Inmos, Meiko, Parsys and Parsytec (producing transputer-based machines), a1d ICL, AMT, MasPar and Thinking Machines (producing SIMD machines), found t h a t the market for parallel applications was too fragmented t o support high-volume sales of large-scale parallel ma- chines based upon specialized processing elements, and t h a t ;~pplication devel- opment was slow and difficult with limited supporting software tools Shared- memory machines produced by major uniprocessor manufacturers such as
IBM, DEC, Intel and Silicon Graphics, and distributed Networks of Work- stations (NOWs) had however established a foothold in the market, because they are based around high-volume rommerrial off-the-shelf (COTS) proces- sors, and achieved penetration in markets such a databare and fileserving where parallelism could be supported within the operating system
In our own application field of embedded systems, NOWs and shared- memory machines have a significant part t o play in supporting the parallel logic development process, but implementation is now increasingly geared t* wards hardware-software co-design Co-design tools may currently be based around heterogeneous computing elements ranging from conventional RISC and DSP processors a t one end of the spectrum, through embedded processor cores such as ARM, to FPGAs and ASICs a t the other Historically, such tools have been developed bottom-up, a n d therefore currently betray a strong hardware design ethos, and a correspondingly weak high-level software design model Our current research (also funded by EPSRC) is investigating how t o extend the P P F design methodology t o address this rapidly developing embed- ded applications market using a software component-based approach, which
we believe cat1 provide a valuable method of unifying current disparate low- level hardware-software cwdesign models Such solutions will surely become essential as complex multimedia embedded applications become widespread
in consumer, commercial and indust,rial mmarketn nver the next decade
A N DY DOWNTON
Colchester, Uct ZOO0
Trang 7Acknowledgments
Although this book has only two named authors, many others have con- tributed t o its content, both by carrying out experimental work and by col- laborating in writing the journal and conference papers from which the book
B T Laboratories through the support of Mike Whybray
Many people a t BT contributed t o this work through the provision of H.261 image coding software, and (later) other application codes for speech recog- nition and microphone beam forming Other software applications, including those for model-based coding, H.263, and Eigenfaces were also investigated
in collaboration with BT In addition t o Mike Whybray, many others at B T laboratories provided valuable support for work there, including Pat Mulroy, Mike Nilsson, Bill Welsh, Mark Sharkl~ton, John Talintyre, Simon R.ingland and Alwyn Lewis BT also donated equipment, including a Meiko CS2 and Texas TMS320C40 DSP systems t o support our xtivities
Trang 8As a result of these early studies, funding was obtained from the EPSRC (the UK Engineering and Physical Sciences Research Council) to investigate the emergent PPF design methodology under a directed program on Portable Software Tools for Parallel Architectures (PSTPA) This project - PSTESPA (Parallel Software Tools for Embedded Signal Processing Applications) - en- abled us not only t o generalise the earlier work, but also to start investigating and prototyping software tools to support the P P F design process Chris Wadsworth from Rutherford Appleton Laboratories was the technical coor- dinator of this program, and has our heartfelt thanks for the support and guidance he provided over a period of nearly four years Adrian Clark, with extensive previous experience of parallel image processing libraries, acted as
a consultant on the PSTESPA project, and Martin Fleury was appointed as our first research fellow, distinguishing himself so much that before the end of the project he had bee11 appoililed l o ihe Department's academic staff Sev- eral other research fellows also worked alongside Martin during the project: Herkole Sava, Nilufer Sarvan, Richard Durrant and Graeme Sweeney, and all contributed considerably to its successful outcome, as is evidenced by their co-authorship of many of the publications which were generated
Publication of this book is possible not only because of the contributions
of the many collaborators listed above, hut also through the kind permission
of the publishers of our journal papers, who have permitted us t o revise our original publications t o present a complete and coherent picture of our work here We particularly wish t o acknowledge the following sources of tables, figures and text extracts which are reproduced from previous publications: The Institution of Electrzcal Engineers (IEE), for perlnissiol~ t o reprint: portions of A C Downton, R W S Tregidgo, and A Cuhadar, Top- down Strurturrd parall~lizat,ion of embedded image processing applica- tions IEE Proceedings Part I (Vision, Image, and Signal Processing),
141(6):438-445, 1994 as text in Chapter 1, as Figure 1.1 and A.l-A.4, and as Table A.l;
portions of M Fleury, A C Downton, and A F Clark, Scheduling schems for d a t a farming, IEE Proceedings Part E (Computers and Dig- ital Techniques), in press at the time of writing, as text in Chapter 6,
as Figures 6.1-6.9, and as Tables 6.1 and 6.2;
portions of A C Downton, Generalised approach to parallelising im-
age sequence coding algorithms, IEE Proceedings I (Vision, Image, and Signal Processing), 141(6):438-445, 1994 as text in Section 8.1, as Fig-
ures A.6-8.12, and as Tables 8.1 and 8.2:
portions of H P Sava, M Fleury, A C Downton, and A F Clark, Parallel pipeline implementation of wavelet transforms, IEE Proceedings Part I (Vision, Image, and Signal Processing), 144(6):355-359, 1997 as
text in Section 9.2, and as Figures 9.6-9.10;
Trang 9portions of M Fleury, A C Downton, and A 1.' Clark, Scheduling schemes for data farming, IEE Proceedings P a r t E (Computers and Digital Techniques), 146(5):227-234, 1994 as text in Section 11.9, as
Figures 11.11-11.17, and as Table 11.6;
portions of M Fleury, H Sava, A C Downton, and A F Clark, De- sign of a clock synchronization sub-system for parallel embedded sys- tems, IEE Proceedings P a r t E (Computers and Digital Techniques),
144(2):65-73, 1997 as text in Chapter 12, as Figures 12.1-12.4, and
as Tables 12.land 12.2
Elsevier Science, for inclusion of the following:
portions reprinted from M~croprocessors and Microsystems, 21, A Cuhadar,
A C Downton, and M Fleury, A structured parallel design for embed-
ded vision systems: A case study, 131-141, Copyright 1997, with per-
mission from Elsevier Science, as text in Chapter 3, ay Figures 3.1-3.10,
and as Table 3.1 and 3.2;
portions reprinted from Image and Vwion Computing, M Fleury, A
F Clark, and A C Downton, Prototyping optical-flow algorithms on a
parallel machine, in press a t the time of writing, Copyright 2000, with
permission from Elsevier Science, as text in Section 8.4, as Figures 8.19- 8.28, and as Tables 8.8-8.12;
portions of Signal Processing: Image Communications, 7, A C Down- ton, Speed-up trend analysis for H.261 and model-based image coding al-
gorithms using a parallel-pipeline model, 489-502, Cupyright 1995, with permission from Elsevier Science, as text in Section 10.2, Figures 10.5-
10.7, and Table 10.2
Springer Verlag, for permission t o reprint:
porlious of H P Sam, M Fleury, A C Donwton, and A F Clark,
A case study in pipeline processor farming: Parallelising the H.263 en-
coder, in UK Parallel'96, 196-205, 1996, as text in Section 8.2, as Fig- ures 8.13-8.15, and as Tables 8.3-8.5;
portions of M Fleury, A C Downton, and A F Clark, Pipelined par- allelization of face recognition, Machine Vision Applications, in press
a t the time of writing, as text in Section 8.3, Figures 5.1 and 5.2, Fig-
ures 8.16-8.18, and Tables 8.6 and 8.7;
portions of M Fleury, A C Downton, and A F Clark, Karhunen-Loeve transform: An exercise in simple image-processing parallel pipelines, in Euro-Par'97, 815-819, 1997, as text in Section 9.1, Figures 9.4-9.5;
portions of M Fleury, A C Downton, and A F Clark, Parallel struc- ture in an integrated speech-recognition network, in Euro-Par'99, 995-
1004, 1999, as text in Section 10.1, Figures 10.1-10.4, and Table 10.1
Trang 10Academic Press, for permission t o reprint:
portions of A Cuhadar, D G Sxnpson, and A C Downton, A scdable
parallel approach to vector quantization, Real-Tame Imaging, 2:241-247,
1995, as text in Section 9.3, Figures 9.11-9.19, and Table 9.2
The Institute of Electrical and Electronic Engineers (IEEE), for permission
t o reprint:
portions of M Fleury, A C Downton, and A F Clark, performance metrics for embedded parallel pipelines, IEEE Transactions in Parallel and Distributed Systems, in press a t the time of writing, ay text in Chapter 11, Figures 2.2-2.4, Figures 11.1-11.10, and as and Tables 11.1- 11.5
John Wiley & Sons Limited, for inclusion oE
portions of Constructiug generic data-farm templates, M Fleury, A
C Downton, and A F Clark, C o n c u r ~ n c y : Practice and Ezperience,
11(9):1-20, 1999, @.lohn Wiley & Sons Limited, reproduced with per- mission, as text in Chapter 7 and Figures 7.1-7.7
The typescript of this book was typeset by the authors using B W , MikTex and WinEdt
A C D and M F
Trang 111.3 Arndahl's Law and Structured Parallel Design
1 , Introduction to I'PF Systems
Trang 12xiv CONTENTS
2.2 Pipeline Types
2.2.1 Asynchronous PPF
2.2.2 Synchronous PPF
2.9 Data Farming and Demand-based Scheduling
2.4 Data-jam Performance Criteria
9.2 Parallelization of the Postcode Recognizer
3.2.1 Partitioning the postcode recognizer
3.2.2 Scaling the postcode recognizer
3.2.9 Perfor7rbance achieved
3.3 Pamllelization of the address verifier
3.3.1 Partitioning the address verifier
3.3.2 Scaling the address verifier
3.3.3 Addms.? verification farms
3.3.4 Overall performance achieved
3.4 Meeting the Specification
Trang 137.3 Parallel Logic Implementation
7.4 Target Machine Implementation
7.4.1 Common implementation issues
7.5 'NOW' Implementation for Logic Debugging 7.6 Target Machine Implementations for Performance Tuning
7.7 Patterns and Templates
estimation
Trang 148.1.4 T e r picture' quantization with motion
estimation 8.1.5 Implementation of the parallel encoders 8.1.6 H.261 encoders without motion estimatio 8.1.7 H.261 encoder with motion estimation 8.1.8 Edge data exchange
8.2 Case Study 2: H263 Encoder/Decoder
8.2.1 Static analysis of H.263 a1,qorithm 8.2.2 Results from parallelizing H.263
8.3 Case Study 3: 'Eigenfaces' - Face Detection 8.3.1 Background
Trang 1510.2 Case Study 2: Model-based Coding
10.2.1 Pamllelization of the model-based coder
Part IV Underlying Theory and Analysis
11.3 Gathering Performance Data
11.4 Performance Prediction Equations
11.5 Results
11.5.1 Prediction results
Trang 1611.6 Simulation Results
11.7 Asynchronous Pipeline Estimate
11.8 Ordering Constraints
11.9 Task Scheduling
11.9.1 Uniform task size
11.9.2 Decreasing task size
11.9.9 Heuristic scheduling schemes
12.5 Establishing a Refresh Interval
12.6 Local Clock Adjustment
12.7 Implementation on the Pammid
Trang 17Advanced Graphics Protocol
Application Programming Interface
Analysis, Prediction, Template Toolkit
Autoregressive
Application Specific Integrated Circuits
Automatic Target Recognition
Abstract Window Toolkit
Berkeley Standard Distribution
Bulk Synchronous Parallel
International Consultative Committee for Telephone and Telegraph
Curr~ulalive Distribution Function
Categorical D a t a Type
Common Intermediate Format
Commercial Off-The-Shelf
Central Processing Unit
Comrnu~licating Sequential Processes
Central Synchronization Server
Trang 18Continuous Wavelet Transform
Directed Acyclic Graph
Distributed Component Object Model
Discrete Cosine Transform
Digital Signal Processor
Digital Versatile Disc
Discrete Wavelet Transform
Fibre Distributed Data Interface
F a t Fourier Transform
First-In-First-Out
Finite Impulse Response
Field Programmable Gate Arrays
International Business Machines
Inverse Fast Fourier Transform
Increasing Failure Rate
International Standards Organization
International Telecommunications Union Just-in-Time
Joint Photographic Experts Group Karhunen-Loeve Transform Local Area Network Large Vocabulary Continuous-Speech Recognition Light-Weight Process
Multiply Accumulate Operation Motion Estimation
Multiple Instruction Multiple DaVa Streams Massachusetts Institute of Technology Multimedia Extension
Motion Picture Experts Group
Trang 19Non-Uniform Memory Access
Optical Character Recognition
Optical Flow
Object-oriented Coding
Personal Computer
Principal Components Algorithm
Probability Dist,ribution Fuliction
Processing Element
PoUac~ek-Khintchine
Portable Operating System-IX
Pipelined Processor Farms
Peak Signal-teNoise Ratio
Public System Telephone Network Parallel Virtual Machine
Reduced Invtruction Set Computer Remote Method Invocation
Remote Procedure Call
Run-time Executive
Real-time Operating System
Synthetic Aperture Radar
Small Computer System Interface Single Instruction Multiple Data Streams Symmetric Multiprocrssvr
Semantic Neural Network
Series Parallel Graph Sum-of-Squared-Differences Safe Self-scheduling
Short-Time Fourier Transform Trademark
Universal Time Coordinated
Trang 20with respect t o
Wavelet Series
World Wide Web
Trang 21Part I
Introduction and Basic
Concepts
Trang 23the design process I t appears that the potential offered by these additional design choices has led to an insistence by designers on obtaining maximum per- formance, with a consequent loss of generality This is not surprising, because parallel solutions are typically investigated for the very reason that conven- tional sequential systems do not provide sufficient performance, but it ignores the benefits of generality which are accepted by sequential programmers The sequential programming paradigm, or rather the abstract model of a computer
on which it rests, was introduced by von Neumann [45] and has persisted ever since iespite the evident internal parallelism in most microprocessor designs (pipelined, vector, and superscalar [115]) and the obvious bottleneck if there is
just one memory-access path from the central processing unit (CPU) for data
'Strictly, the term serial processing is more appropriate, a s processing takes place on a serial machine or processor The term sequential processing implies that the algorithms being processed are inherently sequential, whereas in fact they may coittairl parallel components TTawewr, this book retair~s corrlrrlon usage and takes s~quential processing t o be synonymous
with serial processing
Trang 242 INTRODUCTION
and instructions alike The model suits the way many programmers envisage the execution of their programs (a single step a t a time), perhaps because errors are easier t o find than when there k an interleaving of program order
as in parallel or concurrent programming paradigm^.^
The Pipelined Processor Farms (PPF) design model, the subject of this book, can be applied in its simplest form t o any Multiple Instruction M111- tiple Data streams (MIMD) [I141 multiprocessor ~ y s t e m ~ Single Instruc- tion Multiple Data streams (SIMD) computer architecture, though current a t the very-large scale integration (VLSI) chip-level, and t o a lesser extent in multimedia-extension (MMX) microprocessor instructions for graphics sup- port a t the processor level [212], is largely defunct a t the processor level,
with a few honorable exceptions such as Cambridge Memory System's DAP and the MasPar series of machines [13].~ Of the two categories of MIMD
machines, the primary concentration is upon distributed-memory machines, where the address space is partitioned logically and physically between p r e cessors However, it is equally possible t o logically partition shared-memory machines, where there is a global address space The boundaries between distributed and shared-memory machines have dissolved in recent times [70],
a point t o be returned t o in Chapter 13
1.2 ORIGINS
The origins of the P P F design method arose in the late 1980s a s a result
of research carried out a t the University of Essex to design and implement
a real-time postcode/address recognition system for the British Post Office (see Chapter 3 for a description of the outcome of this process) Initial in- vestigation of the image analysis and pattern recognition problems demon- strated t h a t significant research and development was needed before any kind
of working demonstrator could be produced, and that, of necessity, the first demonstrator would need to be a nun-real-time software simulation running
on a workstation This provided the flexibility t o enable easy experimental evaluation and algorithm updates using offline databases of address images,
2Shared-memory machines can also relax read-write access across the processor set ranging from strattg to weak consistency, presenting a corrtir#uurrt of programming paradigms ['A591
3Categorisation of processors by the multiplicity of parallel data and instruction streams supported is a well-known extension of von Neumann's model [65]
4Systolic arrays are also used far fine-grained, signal processing [ZOO] though largely again
at the VTST level I n systolic designs, data are pumped synchronously across an array
of processing elements (PEs) At each step a different stage in processing takes place
Wavefront processors are an asynchronous version of the systolic architecture Other forms
of instruction level ~arallelism are verv-larae instruction word IVLlWl DSPs Idipital - - signal proces,uri) and 11s iariaut expl:citly yara.lrl invruction coll.p;riop, (EPIC) :319: ~ l l e r d e a
"f trilt~sfi.rritlg 51\11) a l l a y s ,urI~ aj rile D,\P to \.l.il icas also been moored '1'11e 1>11' 'clnp
[66] is an experimental and nwel SIMD VLSI array
Trang 25and also a starting point for consideration of real-time implementation issues
In short, solving the problem at all was very difficult; generating a real-time solution (requiring a throughput of 10 envelope images/second, with a la- tency of no more than 8 seconds for processing each image) introduced a n additional dimension of processing speed which was beyond the bounds of available workstations
A literature survey of the field of parallel processing a t that time showed
that numerous papers had been published on parallelization of individual im- age processing, image coding and image analysis algorithms (see, e.g , [362]), many inspired by the success of the transputer [136] Most of these papers were of limited generality however, since they reported bespoke paralleliza- tion of specific well-known algorithms such as 2-D filters, FFTs, DCTs, edge detectors, component labeling, Hough transforms, wavelets, segmentation al- gorithlris, etc Significantly, examinatioli of many of tliese customized parallel algorithms revealed, in essence, the same solution; that of the single, demand- b-ed, data farm
Practical image analysis and pattern recognition applications, however, typically contain a number of algorithms implemented together as a complete system Like the postal address reading application, the CCITT H.261 en- coderfdecoder algorithm 1491 is also a good illustration of this characteristic, since it includes algorithms for discrete cosine transformation (DCT), mc- tion estimation and compensation, various filters, quantizers, variable length coding, and inverse versions of several of these algorithms Very few papers addressed the issue of parallelizing complete systems, in which individual al- gorithm parallelization could be exploited as components Therefore, a clue t o
an appropriate generic parallel architecture for embedded applications was t o view the demand-based processor farm as a component within a higher-level system framework
From our point of view, parallel processing was also simply a means t o a n end, rather than a n end in itself Our interest was in developing a general vystem design method for MIMD parallel processors, which could be applied after or during the initial iterative algorithm development phase Too great
a focus on performance at the expense of generality would inevitably have resulted in both implementations and design skills that rapidly became ob- solete We therefore aimed t o support the early, architecture independent stages of the design process, where parallelization of complete image process- ing applications is considered, by a process analogous to stepwise refinement
in sequential program design 1312, 3351 Among the advantages of the PPF
design methodology which resulted are the following:
Upper bound (idealized) throughput scaling of the application is easily defined, and aspects of the application which limit scaling are identified
Input/output latency is also defined and can be controlled
Trang 26Design effort is focused on each performance bottleneck of each pipeline stage in turn, by identifying the throughput, latency, and scalability
1.3 AMDAHL'S LAW A N D STRUCTURED PARALLEL DESIGN
Amdahl's law [15, 161 is the Ohm's law of parallel computing It predicts an
upper bound t o the performance of systems which contain both parallelization and inherently sequential components Amdahl's law states that the scaling performance of a parallel algorithm is limited by the number of inherently sequential operations in that algorithm Consider a problem where a fraction
f of the work must be performed sequentially The speed-up, S, possible from
a machine with N processors is:
I f f = 0.2 for example (i.e 20% of the algorithm is inherently sequential), then the maximum speedup however many processors are added is 5
As will be shown in later chapters, applying Amdahl's law t o multi-algorithm embedded systems demonstrates that the scaling which can be achieved is largely defined, not by the number of processors used, but by any residual sequential elements within the complete application algorithm Thus effective system parallelihation requires a method of minimizing the impact of residual sequential code, as well as of parallelizing the bulk of the application a l g e rithm In the PPF design methodology, pipelining is used t o overlap residual sequential code execution with other forms of parallelism
diameter is also restricted The commercial off-the-shelf (COTS) processors used within such machines will outstrip the available interconnect bandwidth
Trang 27INTRODUCTION TO PPF SYSTEMS 5
if combined in large configurations since such processors were not designed with modularity in mind To avoid this problem in P P F , a pipeline is parti- tioned into s number of stagcs, cach onc of which may hc parallcl P P F is primarily aimed a t continuous-flow systems in the field of signal processing, image-processing, and multimedia in general
A continuous-flow system is one in which data never cease t o arrive, for ex- ample a radar processor which must always monitor air traffic These systems frequently need t o meet a variety of throughput, latency, and output-ordering specifications It becomes necessary t o he able t o predict performance, and
t o provide a structure which permits performance scaling, by incremental addition of processors and/or transfer t o higher performance hardware once the initial design is complete The hard facts of achievable performance in a
parallel system are further discussed in Section 2.4
There are two basic or elementary types of pipeline components: asyn- chronous and synchronous, though many pipelined systems will contain some segments of each type P P F caters for any type of pipeline, whether syn- chronous, asynchronous or mixed; their performance characteristics are dis-
cussed in detail in Section 2.2 P i p e l i ~ ~ e systems are a natural choice for some synchronous applications For example, a systolic pipeline-partitioning methodology exist,s for signal-prol:msing algorit,hms wit,h a regnlar pattern
[237] Alternatively, [8] notice that there is an asynchronous pipeline struc-
ture t o the mind's method of processing visual input which also maps onto
computer hardware If all information flow is in the forward direction [8] then
the partitions of the pipeline mirror the peripheral, attentive, and cognitive
stages of human vision [232] The CMU Warp [18], the Cytocomputer [341],
PETAL and VAP [56] are early examples of machines used in pipelined fash- ion for image processing."nput t o the pipeline either takes the form of a
succession of images grouped into a batch (medical slides, satellite images, video frames and the like) or raster-scan in which a stream of pixels is input
in the same order as a video camera scans a scene that is in horizontal, zig- zag fashion P P F generalizes the pipeline away from bespoke hardware and away t o some extent from regular problems Examples of applicable irregu- lar, continuous-flow systems can be found in vision [50] (see Chapter 3), radar
[97], speech-recognition processing [133], and data compression [52] Chap-
ters 8 and 9 give further detailed case studies where P P F has been consciously
applied
P P F is very much a systems approach t o design, that is, it considers the entire system before the individual components Another way of saying this is that P P F is a top-down as opposed t o a bottom-up design methodology For
some years it has been noted [214] that many reported algorithm examples
merely form a sub-system of a vision-processing system while it is a complete
SThe common idea across these machines is t o avoid the expense of a 2D systolic array by
using a linear systolic array
Trang 286 INTRODUCTION
system that forms a pipeline Various systems approaches t o pipeline imple- mentation are then possible With a problem-driven approach it may be diffi- cult to assess the advalllages and disadvantages olalternalive architectures for any one stage of a problem However, equally an architecturedriven design ties a system down t o a restricted range of computer hardware In P P F , the intention is t o design a software structure that, when suitably parameterized, can map onto a variety of machines Looking aside t o a different field, Oracle has ported its relational database system t o a number of apparently dissimi- lar parallel computers [337] including the Sequent Symmetry shared-memory machine and the nCube2 MIMD message-passing computer Analogously t o the database abstract machine, the software pipeline is a flexible structure for the PPF problem domain
Having settled on a software pipeline, there are various forms of exploitable parallelism t o be considered The most obvious form of parallelism is tempo- ral multiplexing, whereby several complete tasks are processed simultaneously, without decomposing individual tasks However, simply increasing the degree
of temporal multiplexing, though it can improve the mean throughput, does not change the latency experienced by a n individual task To reduce pipeline traversal latency, each task must be decomposed t o allow the component parts
t o experience their latency in parallel Geometric parallelism (decomposing
by some partition of the data) or algorithmic parallelism (decomposition by function) are the two main possibilities available for irregularly structured code on medium-grained p r o c e s ~ o r s ~ After geometric decomposition, data must be multiplexed by a farmer process across the processor farm which is why in P P F data parallelism is alternatively termed geometric multiplexing When a processor farm utilizes geometric multiplexing, it is called a data farm, and certainly the term data farm is more common in the l i t e r a t ~ r e ~ This book does not include many examples of algorithmic parallelism, not
by intent but because the practical opportunities of exploiting this form of parallelism are limited An early analysis [277] in the field of single-algorithm
image processing established both the difficulty of finding suitable algorithmic decompositions and the limited speed-up achievable by functional deeomposi- tion However, algorithmic parallelism does have a role in certain applications, which is why it is not discounted in PPF For example, pattern matching may employ a parallel search [202], a form of OR-parallelism, whereby alternative searches take place though only the result of successful searches are retained?
%Dataflow computers [340] have been proposed as a way of exploiting the parallelism inher-
ent in irregularly structured code ( i e code in which there are many decision points result-
ing in branching), but though there are research processors [79], no commercial dataflow computer has ever been produced
'TIIP tern\ dara yaralleliwn is an alteruarivp ro grumerrir parallrlidm, b ~ t t l ~ i c term bas the ditficulr) Ihar dara parallelism iu wociarecl wirll parallel drcomyostti~~n of regular code (i.e code with few branch points) by a parallel compiler
8Divide-and-conquer search algorithms may be termed AND-parallelism, as the result of parallel searches may be combined through an AND-tree (2941
Trang 29INTRODUCTION TO PPF SYSTEMS 7
Bringing together the preceding discussion, it can be stated that:
1 A data sct can bc subdivided over multiple processors (data parallelism
or geometric multiplexing)
2 The algorithm can be partitioned over multiple processors (algorithmic
parallelism)
3 Multiple processors can each process one complete task in parallel (prw
cessor farming or temporal multiplexing)
4 The algorithm can be partitioned serially over multiple processors (pipelin- ing) (pipelining being an instance of algorithmic parallelism)
5 The four basic approaches outlined above can be combined as appropri- ate
The field of low-level image processing [74] illustrates how these forms of
parallelism can be applied within a processor farm:
G e o m e t r i c multiplexing An example of geometric multiplexing is where a frame of iinage data is decomposed onto a grid of processors Typical low-level image-processing operations such a s convolution and filtering can then be carried out independently on each sub-image requiring ref- erence only t o the four nearest neighbor processors for boundary infor- mation To adapt such operations t o a processor farm, the required boundary information for each processor can be included in the original
d a t a packet sent to the processor
A l g o r i t h m i c parallelism In the case of algorithmic parallelism, different parts of an algorithm which are capable of concurrent execution can be farmed t o different processors, for example the two convolutions with horizontal and vertical masks could be executed on separate processors concurrently in the case of a Sobel edge detector [290, 751 The advan- tage of a processor farm in this context is that no explicit synchroniza- tion of processors is required; however, the algorithm itself normally defines explicitly the possible degree of parallelism (i.e incremental scaling is not possible)
T e m p o r a l multiplexing Applying each of a sequence of images t o a sep- arate processor does not speed up the time to process an individual image, but enables the average system throughput t o be scaled up in direct proportion t o the number of processors used The approach is limited by the allowable latency between the input and output of the system, which is not reduced by temporal parallelism
P i p e l i n i n g Pure pipelining has the same effect as temporal multiplexing in speeding up overall application throughput without reducing the latency
Trang 30t o 0.25 tasks/second (limited by t h e slowest pipeline stage), a speedup
of 2.5 Note however that the latency (delay between task input and task output) increases from 10 seconds for the sequential algorithm t o 15
seconds (3 x 4 seconds + 3 seconds for the final stage) for the unbalanced pipeline shown in Fig l l a )
The role of pipelining within the PPF design philosophy is t o increase throughput and reduce latency by allowing necessarily independent compo- nents of an application (some of which may be inherently sequential) t o be overlapped
By combining the techniques described above, and mapping a PPF archi- tecture onto the pipeline of stages which comprises any embedded applica- tion, both the throughput and the latency of the application can be scaled Fig l l b illustrates the effect of using temporal multiplexing alone t o achieve throughput scaling: when the throughput of each pipeline stage is matched
a t 1 task/second, a speedup of 10 is achieved with the same latency as: the original sequential algorithm Of course, exactly the same throughput scaling (with unchanged latency) could be achieved using a single processor farm, with each processor executing a copy of the complete application The reason for using a pipeline instead is t o break down the overall application into its sub-components, so that data or algorithmic parallelism can be exploited t o reduce latency as well as increase throughput
Finally, Fig l l c illustrates the exploitation of data or algorithmic paral- lelism in each pipeline stage instead of temporal multiplexing: in this case, the same speedup of 10 is achieved, but with a reduction of latency t o 4 seconds
Appendix A l below illustrates how bmic profiling data, extracted from exe-
cution of a sequential image coding algorithm, can be used t o guide the PPF
design process t o achieve a scalahle parallel implemrnt,at.k,n of the algorithm with analytically defined performance bounds
1.5 CONCLUSIONS
The primary requirement in parallelizing embedded applications is t o meet
a particular specification for throughput and latency The Pipeline Proces- sor Farm (PPF) design model maps conveniently onto the software structure
of many continuous data flow embedded applications, provides incrementally scalable performance, and enables upper-bound scaling performance t o be eas- ily estimated from profiling data generated by the original sequential imple-
Trang 31Pcr-stage 1 2 s 4 s
lat
a) Simple pipeline Throughput = 0.25 jobsls Latency = 15 s
Trang 3210 INTRODUCTION
mentation Using the PPF model, sequential sub-components of the complete application are identified from which data or algorithmic parallelism can be easily extracted Where neither of these forms-of parallelism is exploitable (i.e the residual sequential components identified in Amdahl's law), tem- poral multiplexing can often be used t o match pipeline throughput without reducing latency Each pipeline stage will then normally map directly onto the major functional blocks of the software implementation, written in any procedural language Furthermore, the exact degree of parallelization of each block required t o balance the pipeline can he determined directly from its sequential execution time
Appendix
A l SIMPLE DESIGN EXAMPLE: T H E H.261 DECODER
Image sequence coding algorithms are well known t o be computationally inten- sive, due in part t o the massive continuous input/output required t o process
up t o 25 or 30 image frames per second, and in part t o the computational complexity of the underlying algorithms In fact, it was noted (in 1992) [380] that it was only just possible t o implement the full H.261 encoder algorithm for quarter-CIF (176 x 144 pixels) images in real time on DSP chips such
as the TMS 320C30 In this case study, a non-real-time H.261 decoder a l g e rithm developed for standards work a t B T Laboratories and written in C, was parallelized t o speed up execution on a n MIMD transputer-based Meiko Com- puting Surface Results presented are based upon execution times measured when the H.261 algorithm was run on sequences of 352 x 288 pixel common intermediate format (CIF) images
Fig A.1 shows a simplified representation of the H.261 decoder architec- ture The decoder consists of a 3-stage pipeline of processes, with feedback
of the previous picture applied around the second stage Feedback within
a pipeline is a key constraint on parallelism, since it restricts the degree to which temporal multiplexing can be exploited: in the H.261 decoder, the re- constructed previous frame is used to construct the current frame from the decoded difference picture
Table A.l summarizes the most computationally intensive functions within the B T H.261 decoder, and is derived from statistics generated by the Sun profiling tool gprof [138] while running the decoder on 30 image frames of data on a Sparc2 processor To simplify interpretation, processing times have been normalized for one frame of data The 10 functions listed in the table constitute 99.2% of total execution time
Program execution of the H.261 decoder can be broken down on aper-frame basis into a pipeline of three major components:
T I frame initialization (functions 1 and 2 in Table A.l);
Trang 33SIMPLE DESIGN EXAMPLE: THE H.261 DECODER 11
Fig A.1 Simplified representation of the H.261 decoder execution timing
Table A l Summay Execution Profile Statistics for the H.261 Decoder Sequence Function name Normalized Execution Time (s)
previous picture
5.513s
Trang 3412 INTRODUCTION
T2 frame decoder loop (functions 3-8 in 'I'able A.l); and
T3 frame output (functions 9 and 10 in Table A.l)
The first and last of these components are executed once for each image frame, whereas the middle component contains considerable data parallelism
and involves a loop executed 396 times (once for each 16 x 16 pixel macroblock making up a CIF picture) It is therefore clear that considerable scope exists for speeding up the middle stage of the pipeline by exploiting data parallelism Temporal multiplexing cannot be utilized because each image frame is recon- structed by means of a difference picture added t o the motion-compensated previous frame (although it would be possible t o partially overlap the decod- ing of consecutive frames) Since pipeline stages T1 and T3 are inherently sequential, direct application of Amdahl's law t o the data in Fig A.1 shows that f = 0.22, giving a maximum speedup of only 4.55 An asymptotic ap- proach t o this speedup could be obtained by parallelizing the decoder using
a single processor farm, with the data-parallel component T2 farmed onto worker processors, and the remaining code executed on the master processor Thc upper-bound predicted speedup for the P P F is presented graphically
in Fig A.2 and may be represented theoretically by the followirig piecewise approximation:
where the first and last stages of the P P F contain a single processor, the second processor farm stage contains n - 2 processors, and TI-T3 are execution times
of the three stages nf the pipeline shown in Fig A.2 As the throughput for
a P P F is defined solely by the slowest pipeline stage, its speedup is given by the ratio of sequential application execution time t o the execution time for this stage alone (this illustrates the advantage of the pipeline in overlapping execution of residual sequential components) Where (as in this case) the
slowest stage is perfectly parallelizable (i.e it contains no residual sequential
elements and thus f = 0 in Amdahl's law), linear speedup is obtained up t o the point where the scaled ytage is no longer the slowest The first equation defines this case where the performance is increasing linearly as the number of workers in the processor farm increases (S is proportional t o n ) ; this continues until the execution time for the processor farm drops below that of the next slowest stage, T 3 in this case The second equation then defines the fixed scaling achieved for any further increase in processor numbers (S is fixed and independent of n)
It is assumed that the processor farm implementing the middle stage of the pipeline receives its work packets directly from the first stage and passes
Trang 35SIMPLE DESIGN EXAMPLE; THE H.261 DECODER 13
Fig A 2 Idealized and actual speedup for the H.261 decoder
its results directly t o the third stage, a s in the topology of Fig A.3, where
a n implementation with five worker processors is shown T h e analysis is of
course idealized, ignores communication overheads, and assumes static task characteristics As can he seen from Fig A.2, the performance is predicted
t o scale linearly up t o six workers (8 processors total)
Fig A 3 PPF topology for a 3-stage pipeline with 5 workers in the second stage
Trang 3614 INTRODUCTION
Actual scaling performance results are also presented in Fig A.2, for two
different practical cases In both cases, the scaling performance is less than that predicted in the idealized graph, due t o communication overheads being neglected, but the general shape of t h e graphs is in other respects as pre- dicted The maximum speedup obtained (5.59) exceeds the limit predi~ted
by Amdahl's law, thus demonstrating t h e advantage which the P P F has com- pared with a single processor farm implementation In practice, transputer communication links do not provide sufficient bandwidth for real-time commu- nication of H.261 CIF picture data structures, and therefore communication
overheads substantially limit the performance scaling which can be achieved
in a transputer-based system On the AMD Sharc family of processors with six link ports, real-time parallel processing of images sequences is far more
practicable For example, the ADSP-21160 1141, running a t 100 MHz, sup- ports 'glueless' multiprocessi~~g~ and floating point like the transputer, hut now is superscalar, with a maximum of six issues per cycle
In the first implementation, each image was simply subdivided into a num- ber of horiaontal strips defined by the number of processor farm workers, in line with the idealized model of data parallelism presented earlier As can
be seen from Fig A.4(a), this results in a series of black strips in the re- constructed image, where data adjacent t o each worker'y sub-image were not available for constructing the motion compensated previous image In the second implementation, additional rows of macroblocks a t the boundaries of the sub-image processed by each farm worker were exchanged in a second communication phase between the master and worker processors in the prw cessor farm, after the difference image had been decoded This enables the full motion compensated previous image t o be reconstructed, as shown in Fig A.4(b), but results in an additional communication overhead, which de-
creases scaling performance compared with the case where edge data are not exchanged
Trang 37SIMPLE DESIGN EXAMPLE: THE H.261 DECODER 15
Fig A.4 Sample image output by the pardel H.261 decoder with 5 workers (a) with- out edge data exchange and (b) with edge data exchange
Trang 392
Basic
-
Consider automatic target recognition (ATR) of aircraft found by Synthetic
design features of a PPF There is: a single flow of processing control through
a need to coordinate the flow of data across the hierarchy of ATR algorithms,
17
Pipelined Processor Farms: Structured Design for Embedded Parallel Systems
Martin Fleury, Andrew Downton Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-38860-2 (Hardback); 0-471-22438-3 (Electronic)
Trang 4018 BASIC CONCEPTS
Having arrived a t a design for (say) a DSP processor at the computation layer, why consider an FPGA alternative? Why, in fact, partition an appli- cation between a communication structure and computation layer? The key problem a parallel system designer must face is how t o make a system scalable This is not simply because larger (or smaller) problems can be tackled solely
by adding hardware in an incremental fashion, without otherwise changing the design, important though a modular design remains Equally important
is that uniprocessor performance increases in proportion t o the number of transistors on a microchip, which has been observed t o double approximately every eighteen months (the well-known Moore's law') Therefore, a design tied to a specialized parallel machine may well be rapidly overtaken in terms
of price and performance by a uniprocessor implementation The principal reason for the shift t o COTS hardware is t o exploit the economies in scale that arise within the uniprocessor market, which lead t o exponential gains in performance In other words, by exchanging the computation hardware within the design, which can also be modular, a design is made doubly scalable, and hopefully future-proof As thc life cycle of s typical commercial microproces- sor is less than five years, while the life time of many embedded products is much longer (e.g an avionics system has a lifetime greater than thirty years),
system (or code) portability [23] is an important method of amortizing the
investment in the original embedded software
A P P F design is a pipeline of processor farms The essence of a processor farm within P P F is one central farmer, manager, or controller process, and a
set of worker or slave processev spread across t h e processor farm Notice that there is no insistence in PPF on having a single worker process per processor, though in fact our farm template design (Chapter 7) does not exploit parallel slackness [329] by having more than one process t o a processor In a shared- memory MIMD machine, worker threads [I891 replace worker processes, a thread being a single line of instruction control existing in a shared or global address space The role of the farmer is not simply t o coordinate the activity
of the workers but additionally t o pass partially-processed work onto the next stage of processing Dy introducing modularity, each module being a farm,
it becomes possible t o cope with heterogeneous hardware, and separately to scale each farm as larger problems or versions of the original problem are tackled
P P F is appropriate for all applications with continuous data input/output,
a characteristic typical of soft, real-time, embedded systems.2 However, P P F
is by no means a panacea for all such embedded systems, and in Chapter 10,
'Mooreis law is named after Gordon Moore, co-founder and Chairman of Intel, who discov- ered the law in 1965
2Soft, real-time systems as opposed to hard, real-time systems are those in which respan- siwness t o deadlines can be relaxed Hard, real-time systems [216] usually involve the corltrol of rnacl~inery, such as in fly-by-wire aviortics and industrial manufacturing control, and are not the subject of this book