The designmethodology requires three inputs: Application specification Client/server Streaming Abstract platform specification Performance analysis Component assembly Streaming Dynamic t
Trang 1object request broker (HORBA) when the support of small-grain parallelism
is needed
Our most recent developments in MultiFlex are mostly focused on thesupport of the streaming programming model, as well as its interactionwith the client–server model SMP subsystems are still of interest, and theyare becoming increasingly well supported commercially [14,21] Moreover,our focus is on data-intensive applications in multimedia and communica-tions For these applications, our focus has been primarily on streaming andclient–server programming models for which explicit communication centricapproaches seem most appropriate
This chapter will introduce the MultiFlex framework specialized at porting the streaming and client–server programming models However,
sup-we will focus primarily on our recent streaming programming model andmapping tools
7.2.1 Iterative Mapping Flow
MultiFlex supports an iterative process, using initial mapping results toguide the stepwise refinement and optimization of the application-to-platform mapping Different assignment and scheduling strategies can beemployed in this process
An overview of the MultiFlex toolset, which supports the client–serverand streaming programming models, is given in Figure 7.2 The designmethodology requires three inputs:
Application specification Client/server Streaming
Abstract platform specification
Performance analysis Component assembly
Streaming Dynamic tools
Intermediate representation (IR) Map, Transform & schedule of IR Client/server Client/server
Target platform Video
platform
Mobile platform
Multimedia platform
FIGURE 7.2
MultiFlex toolset overview
Trang 2• The application specification—the application can be specified as a set
of communicating blocks; it can be programmed using the streamingmodel or client–server programming model semantics
• Application-specific information (e.g., quality-of-service requirements,measured or estimated execution characteristics of the application,data I/O characteristics, etc.)
• The abstract platform specification—this information includes themain characteristics of the target platform which will execute theapplication
An intermediate representation (IR) is used to express the high-level cation in a language-neutral form It is translated automatically from one ormore user-level capture environments The internal structure of the appli-cation capture is highly inspired by the Fractal component model [23].Although we have focused mostly on the IR-to-platform mapping stages, wehave experimented with graphical capture from a commercial toolset [7], and
appli-a textuappli-al cappli-apture lappli-anguappli-age similappli-ar to Streappli-amIt [3] happli-as appli-also been experimentedwith
In the MultiFlex approach, the IR is mapped, transformed, and uled; finally the application is transformed into targeted code that can run
sched-on the platform There is a flexibility or performance trade-off between whatcan be calculated and compiled statically, and what can be evaluated at run-time As shown on Figure 7.2, our approach is currently implemented using
a combination of both, allowing a certain degree of adaptive behaviors, whilemaking use of more powerful offline static tools when possible Finally, theMultiFlex visualization and performance analysis tools help to validate thefinal results or to provide information for the improvement of the resultsthrough further iterations
7.2.2 Streaming Programming Model
As introduced above, the streaming programming model [1] has beendesigned for use with data-dominated applications In this computingmodel, an application is organized into streams and computational kernels
to expose its inherent locality and concurrency Streams represent the flow ofdata, while kernels are computational tasks that manipulate and transformthe data Many data-oriented applications can easily be seen as sequences oftransformations applied on a data stream Examples of languages based onthe streaming computing models are: ESTEREL [4], Lucid [5], StreamIt [3],Brooks [2] Frameworks for stream computing visualization are also avail-able (e.g., Ptolemy [6] and Simulink R [7]).
In essence, our streaming programming model is well suited to adistributed-memory, parallel architecture (although mapping is possible
on shared-memory platforms), and favors an implementation using ware libraries invoked from the traditional sequential C language, ratherthan proposing language extensions, or a completely new execution model
Trang 3soft-The entry to the mapping tools uses an XML-based IR that describes theapplication as a topology with semantic tags on tasks During the mappingprocess, the semantic information is used to generate the schedulers and allthe glue necessary to execute the tasks according to their firing conditions.
In summary, the objectives of the streaming design flow are:
• To refine the application mapping in an iterative process, rather thanhaving a one-way top-down code generation
• To support multiple streaming execution models and firing conditions
• To support both restricted synchronous data-flow and more dynamicdata-flow blocks
• To be controlled by the user to achieve the mechanical transformationsrather than making decisions for him
We first present the mapping flow in the Section 7.3, and at the end of thesection, we will give more details on the streaming programming model
7.3 MultiFlex Streaming Mapping Flow
The MultiFlex technology includes support for a range of streaming gramming model variants Streaming applications can be used alone or ininteroperation with client–server applications The MultiFlex streaming toolflow is illustrated in Figure 7.3 The different stages of this flow will bedescribed in the next sections
pro-The application mapping begins with the assignment of the applicationblocks to the platform resources The IR transformations consist mainly insplitting and/or clustering the application blocks; they are performed foroptimization purposes (e.g., memory optimization); the transformations alsoimply the insertion of communication mechanisms (e.g., FIFOs, and localbuffers)
The scheduling defines the sharing of a processor between several blocks
of the application Most of the IR mapping, transforming, and scheduling
is realized statically (at compilation time), rather than dynamically (at time)
run-The methodology targets large-scale multicore platforms including auniform layered communication network based on STMicroelectronics’network-on-chip (NoC) backbone infrastructure [18] and a small number
of H/W-based communication IPs for efficient data transfer (e.g., oriented DMAs or message-passing accelerators [9]) Although we considerour methodology to be compatible with the integration of application-specific hardware accelerators using high-level hardware synthesis, we arenot targeting such platforms currently
Trang 4stream-Application functional capture Filter core
C functions
Filter dataflow
B3
Comm and H/W abstraction library Component assembly
Commn.
resources and topology
Number and types of PE
Platform specification
Abstract Comm services Intermediate representation (IR)
MpAssign
MpCompose
Target platform Video
platform
Mobile platform
Multimedia platform
• The partitioning level—at this level the application blocks are grouped
in partitions; each partition will be executed on a PE of the targetarchitecture PEs can be instruction-set programmable processors,reconfigurable hardware or standard hardware
• The communication level—at this level, the scheduling and the munication mechanisms used on each processor between the differentblocks forming a partition are detailed
com-• The target architecture level—at this level, the final code executed onthe targeted platforms is generated
Table 7.2 summarizes the different abstractions, models, and tools provided
by MultiFlex in order to map complex data-oriented applications onto tiprocessor platforms
Trang 5mul-TABLE 7.2
Abstraction, Models, and Tools in MultiFlex
Application level Set of communicating blocks Textual or graphical
front-endPartition level Set of communicating blocks
and directives to assignblocks to processors
MpAssign
Communication level Set of communicating blocks
and requiredcommunication components
MpCompose
Target architecture level Final code loaded and
executed on the targetplatform
Component-basedcompilation back-end
7.3.2 Application Functional Capture
The application is functionally captured as a set of communicating blocks
A basic (or primitive) block consists of a behavior that implements a known
interface The implementation part of the block uses streaming applicationprogramming interface (API) calls to get input and output data buffers tocommunicate with other tasks Blocks are connected through communica-tion channels (in short, channels) via their interfaces The basic blocks can be
grouped in hierarchical blocks or composites.
The main types of basic blocks supported in MultiFlex approach are
• Simple data-flow block: This type of block consumes and producestokens on all inputs and outputs, respectively, when executed It islaunched when there is data available at all inputs, and there is suf-ficient free space in downstream components for all outputs to writethe results
• Synchronous client–server block: This block needs to perform one ormany remote procedural calls before being able to push data in theoutput interface It must therefore be scheduled differently than thesimple data-flow block
• Server block: This block can be executed once all the arguments of thecall are available Often this type of block can be used to model a H/Wcoprocessor
• Delay memory: This type of block can be used to store a given number
of data tokens (an explicit state)
Figure 7.4 gives the graphical representation of a streaming application ture which interacts with a client–server application Here, we focus mostly
cap-on streaming applicaticap-ons
Trang 6Sync dataflow semantics Composite
Memory delay
App interface
Dataflow interface Init
Process
End
State
Max elem Data type Token rate
Synchronous client semantics
Server(s) semantics
int method1 ( ) int method2 ( )
FIGURE 7.4
Application functional capture
From the point of view of the application programmer, the first step is tosplit the application into processing blocks with buffer-based I/O ports Usercode corresponding to the block behavior is written using the C language.Using component structures, each block has its private state, and imple-ments a constructor (init), a work section (process), and a destructor (end)
To obtain access to I/O port data buffers, the blocks have to use a fined API A run-to-completion execution model is proposed as a compro-mise between programming and mapping flexibility The user can extendthe local schedulers to allow the local control of the components, based
prede-on applicatiprede-on-specific cprede-ontrol interfaces The dataflow graph may cprede-ontainblocks that use client–server semantics, with application-specific interfaces,
to perform remote object calls that can be dispatched to a pool of servers
aver-2 Communication volume: the size of data exchanged on this channel
3 User assignment directives Three types of directives are supported bythe tool:
a Assign a block to a specific processor
b Assign two blocks to the same processor (can be any processor)
c Assign two blocks to any two different processors
Trang 77.3.4 The High-Level Platform Specification
The high-level platform specification is an abstraction of the processing,communication, and storage resources of the target platform In the currentimplementation, the information stored is as follows:
• Number and type of PEs
• Program and data memory size constraints (for each programmablePE)
• Information on the NoC topology Our target platform uses theSTNoC, which is based on the “Spidergon” topology [18] We includethe latency measures for single and multihop communication
• Constraints on communication engines: Number of physical linksavailable for communication with the NoC
7.3.5 Intermediate Format
MultiFlex relies on intermediate representations (IRs) to capture the tion, the constraints, and high-level platform descriptions The topology ofthe application—the block declaration and their connectivity—is expressedusing an XML-based intermediate format It is also used to store task anno-tations, such as the block execution semantics Other block annotations areused for the application profiling and block assignments Edges are anno-tated with the communication volume information
applica-The IR is designed to support the refinement of the application as it
is iteratively mapped to the platform This implies supporting the tiple abstraction levels involved in the assignment and mapping processdescribed in the next sections
mul-7.3.6 Model Assumptions and Distinctive Features
In this section, we provide more details about the streaming model Thisbackground information will help in explaining the mapping tools in the nextsection
The task specification includes the data type for each I/O port as well asthe maximum amount of data consumed or produced on these ports Thisinformation is an important characteristic of the application capture because
it is at the foundation of our streaming model: each task has a known putation grain size This means we know the amount of data required tofire the process function of the task for a single iteration without starving
com-on input data, and we know the maximum amount of output data that can
be produced each time This is a requirement for the nonblocking, or completion execution of the task, which simplifies the scheduling and com-munication infrastructure and reduces the system overhead Finally, we canquantify the computation requirements of each task for a single iteration
Trang 8run-to-The run-to-completion execution model allows dissociating thescheduling of the tasks from the actual processing function, providingclear scheduling points Application developers focus on implementing andoptimizing the task functions (using the C language), and expressing thefunctionality in a way that is natural for the application, without trying tobalance the task loads in the first place This means each task can work on adifferent data packet size and have different computation loads The assign-ment and scheduling of the tasks can be done in a separate phase (usuallyperformed later), allowing the exploration of the mapping parameters, such
as the task assignment, the FIFO, and buffer sizes, to be conducted withoutchanging the functionality of the tasks: a basic principle to allow correct-by-construction automated refinement
The run-to-completion execution model is a compromise, requiring moreconstrained programming but leads to higher flexibility in terms of mapping.However, in certain cases, we have no choice but to support multiple concur-rent execution contexts We use cooperative threading to schedule specialtasks that use a mix of streaming and client–server constructs Such tasksare able to invoke remote services via client–server (DSOC) calls, includingsynchronous methods (with return values) that cause the caller task to block,waiting for an answer
In addition, we are evaluating the pros and cons of supporting tasks withunrestricted I/O and very fine-grain communication To be able to eventu-ally run several tasks of this nature on the same processor, we may need asoftware kernel or make use of hardware threading if the underlying plat-form provides it
To be able to choose the correct scheduler to deploy on each PE, we have
introduced semantic tags, which describe the high-level behavior type of each
task This information is stored in the IR We have defined a small set oftask types, previously listed in Section 7.3.2 This allows a mix of executionmodels and firing conditions, thus providing a rich programming environ-ment Having clear semantic tags is a way to ensure the mapping tools canoptimize the scheduling and communications on each processor, rather thansystematically supporting all features and be designed for the worst case.The nonblocking execution is only one characteristic of streaming com-pared to our DSOC client–server message-passing programming model Asopposed to DSOC, our streaming programming model does not provide datamarshaling (although, in principle, this could be integrated in the case of het-erogeneous streaming subsystems)
When compared to asynchronous concurrent components, another tinction of the streaming model is the data-driven scheduling In event-based programming, asynchronous calls (of unknown size) can be generatedduring the execution of a single reaction, and those must be queued Thequantity of events may result in complex triggering protocols to be definedand implemented by the application programmer This remains to be a well-known drawback of event-based systems With the data-flow approach, the
Trang 9dis-clear data-triggered execution semantic, and the specification of I/O dataports resolve the scheduling, memory management, and memory ownershipproblems inherent to asynchronous remote method invocations.
Finally, another characteristic of our implementation of the streamingprogramming model, which is also shared with our SMP and DSOC models,
is the fact that application code is reused “as is,” i.e., no source code formations are performed We see two beneficial consequences of this com-mon approach In terms of debugging, it is an asset, since the programmercan use a standard C source-level debugger, to verify the unmodified code
trans-of the task core functions The other main advantage is related to prtrans-ofiling.Once again, it is relatively easy for an application engineer to understandand optimize the task functions with a profiling report, because his sourcecode is untouched
7.4 MultiFlex Streaming Mapping Tools
7.4.1 Task Assignment Tool
The main objective of the MpAssign tool (see Figure 7.5) is to assign tion blocks to processors while optimizing two objectives:
applica-1 Balance the task load on all processors
2 Minimize the inter-processor communication load
B5
B3
B3
Assignment directives
Communication
volume
Platform specification Number and types of PE Commn.
resources and topology Storage resources User
assignment
directives
FIGURE 7.5
MpAssign tool
Trang 10The inter-processor communication cost is given by the data volumeexchanged between two processors, related to each task.
The tool receives as inputs the application capture, the application straints, and the high-level platform specification The output of the tool
con-is a set of assignment directives specifying which blocks are mapped oneach processor, the average load of each processor, and the cost for eachinter-processor communication The lower portion of Figure 7.5 gives avisual representation of the MpAssign output The tool provides the visualdisplay of the resulting block assignments to processors
The implemented algorithm for the MpAssign tool is inspired fromMarculescu’s research [10] and is based on graph traversal approaches,where ready tasks with maximal 2-minimal cost-variance are assignediteratively The two main graph traversal approaches implemented inMpAssign are
• The list-based approach, using mainly the breadth-first principle—atask is ready if all its predecessors are assigned
• The path-based approach, using mainly the depth-first principle—atask is ready if one predecessor is assigned and it is on the critical path
min-This assumes state space exploration for a predefined look-ahead depth w i
represents the weight associated with each cost factor (Cproc, Ccomm, and
Csucc) and indicates the significance of the factor in the total cost C
p , tascompared with the other factors The factors are weighted by the designer toset their relative importance
7.4.2 Task Refinement and Communication Generation Tools
The main objective of the MpCompose tool (see Figure 7.6) is to generate oneapplication graph per PE, each graph containing the desired computationblocks from the application, one local scheduler, and the required communi-cation components To perform this functionality, MpCompose requires thefollowing three inputs:
Trang 11Filter core
C functions
Application IR (with assignments)
MpCompose
Scheduler
Scheduler B2
B1
B3 B1 B2
LB 1/2
FIGURE 7.6
MpCompose tool
• The application capture
• The platform description
• The set of directives, optionally generated by the MpAssign toolThe MpCompose tool relies on a library of abstract communication servicesthat provide different communication mechanisms that can be inserted inthe application graph Three types of services are currently supported byMpCompose:
1 Local bindings consisting mainly of a FIFO implemented with memorybuffers and enabling the intra-processor communication (e.g., block B1
is connected to block B2 via local buffer LB1/2)
2 Global binding FIFOs, which enable the inter-processor communication(e.g., block B1 on PE1 communicates to block B3 on PE2 via externalbuffers GB1/3)
3 A scheduler on each PE, which is configurable in terms of number andtypes of blocks and which enables the sharing of a processor betweenseveral application blocks
A set of libraries are used to abstract part of the platform and provide munication and synchronization mechanism (point-to-point communication,semaphores, access to shared memory, access to I/O, etc.) The various FIFOcomponents have a default depth, but these are configuration values that can
com-be changed during the mapping Since we support custom data types for I/Oport tokens, each element of a FIFO has a certain size that matches the datatype and maximum size specified in the intermediate format
Trang 12There is no global central controller: a local scheduler is created on eachprocessor This component is the main controller and has access to the controlinterface of all the components it is responsible for scheduling The propercontrol interface for each filter task is automatically added, based on the type
of filter specified in the application IR, and connected to the scheduler Theimplementations of the schedulers are partly generated, for example, the list
of filter tasks (a static list) and some setup code for the hardware cation accelerators are automatically created The core scheduling functioncan be pulled from a library or customized by the application programmer.The output of MpCompose is a set of component descriptions; one foreach processor From the point of view of the top-level component defi-nitions, these components are not connected together; however, communi-cating processors use the platform-specific features to actually implementthe buffer-based communication at runtime The set of independent compo-nent definitions allow a monoprocessor component–based infrastructure to
communi-be used for compilation
7.4.3 Component Back-End Compilation
Starting from the set of processor graphs, the component back-end ates the targeted code that can run on the platform MultiFlex tools currentlytarget the Fractal component model, and more specifically its C implemen-tation [19] Even though this toolset supports features such as a bindingcontroller and a life cycle manager to allow dynamic insertion–removal ofcomponents in the graph at runtime, we are not currently using any of thedynamic features of components, such as runtime elaboration, introspection,etc., mainly for code size reasons Nevertheless, we expect multimedia appli-cation requirements to push toward this direction Until then, we mainly usethe component model as a back-end to represent the software architecture to
gener-be built on each processor MpCompose generates one architecture (.fractal)file describing the components and their topology for each CPU The Fractaltools will generate the required C glue code to bind components, to createblock instance structures and will compile all the code into an executablefor the specified processor by invoking the target cross-compiler This buildprocess is invoked for each PE, thus producing a binary for each processor
7.4.4 Runtime Support Components
The main services provided by the MultiFlex components at runtime arescheduling and communication The scheduler in fact controls both the com-munication components and the application tasks
The scheduler interleaves communication and processing at the blocklevel For each input port, the scheduler scans if there is available data in thelocal memory If not, it checks if the input FIFO is empty If not, the sched-uler orders the input FIFO to perform the transfer into local memory This is
Trang 13typically done by some coprocessors such as DMA or specialized hardwarecommunication engines While the transfer occurs, the scheduler can manageother tasks In the same manner, it can look for previously produced outputdata ready to be transmitted from local memory to another processor, using
an output FIFO Tasks with more dynamic (data-dependant) behaviors mayproduce less data than their allowed maximum, including no data at all If atask is ready to execute, the scheduler simply calls its process function in thesame context The user tasks make use of an API that is based on pointers,thus we avoid data copies between the tasks and the local queues managed
by the scheduler
So in a nutshell, the run-to-completion model allows the scheduler to runready tasks, manage input and output data consumed or produced by thetasks, while allowing data transfers to take place in parallel, thus overlappingcommunication and processing without the need for threading The tasks canhave different computation and communication costs: the mapping tools willhelp to balance the overall tasks load between processors, with the objective
to keep the streaming fabric busy and the latency minimized
7.5 Experimental Results
7.5.1 3G Application Mapping Experiments
In this section, we present mapping results using the MpAssign tool on anapplication graph having the characteristic of a 3G WCDMA/FDD base-station application from [13]
The block diagram of this application is presented in Figure 7.7 andcontains two main chains: transmitter (tx) and receiver (rx) The blocks
rx_vtd_0 (205)
b b
b b
d c
c
a
tx_vtc_0 (75) tx_vtc_1 (75)
tx_rm (40)
tx_fi (80)
rx_fi (80)
tx_rfs (30)
rx_rfa (30)
rx_rm (40)
tx_si (80) rx_rake (175)
tx_si (80)
tx_sm (170) M
A
C
R a d i o i n t e r f a c e
FIGURE 7.7
3G application block diagram The communication volumes are a = 260,
b = 3136, c = 1280, and d = 768.
Trang 14are annotated with numbers that represent the estimated processing load,while each edge has an estimated communication volume given in the figurecaption These numbers are extracted from [13], where the computation costcorresponds to a latency (in microseconds) for a PE to execute one iteration ofthe corresponding functional block, while the edge cost corresponds to thevolume of data (in 16-bit words) transferred at each iteration between theconnected functional blocks.
A manual and static mapping of this application is presented in [13],using a 2D-mesh of 46 PEs, where each PE is executing only one of thefunctional blocks, for which some of them are duplicated to expose morepotential parallel processing We use this example in this chapter mainly forillustrative purposes, to show that MpAssign can be used to explore auto-matically different mappings where, optionally, multiple functional blockscan be mapped on the same PE to balance the processing load To exposemore potential parallel processing, we create a set of functionally equivalentapplication graphs of the above reference application in which we duplicatethe transmitter and receiver processing chains several times In our experi-ments, four versions have been explored:
• v1: 1 transmitter and 1 receiver (original reference application)
• v2: 2 transmitters and 2 receivers
• v3: 3 transmitters and 3 receivers
• v4: 4 transmitters and 4 receivers
The version v1 will be mapped on a 16 processor architecture (v1/16) Theversion v2 will be mapped on a 16 processor architecture (v2/16) and a
32 processor architecture (v2/32) The version v3 will be mapped on a 32processor architecture (v3/32) and a 48 processor architecture (v3/48) Theversion v4 will be mapped on a 48 processor architecture (v4/48) This results
in six different mapping configurations (v1/16, v2/16, v2/32, v3/32, v3/48,v4/48) to explore
For the experiments, we suppose that each PE can execute any of the tional blocks, and that the NoC connecting all the PEs is the STMicroelectron-ics Spidergon [18]
func-As described in section “Task func-Assignment Tool,” our mapping tic allows exploring different solutions in order to find a good compromisebetween communication and PE load balancing These different solutionscan be obtained by varying the parameters w1, w2, and w3 (see Equation 7.1):
heuris-A high value of w1 promotes solutions with good load balancing, while ahigh value of w2 promotes solutions with minimal communications Theparameter w3, which favors the selection based on an optimistic look-aheadsearch, will be fixed at 100 For our experiments three combinations of w1and w2 will be studied:
Trang 15• c1 (w1 = 1000, w2 = 10): This weight combination tends to maximize
the load balancing
• c2 (w1 = 100, w2 = 100): This weight combination tends to balance
load and communications
• c3 (w1 = 10, w2 = 1000): This weight combination tends to minimize
the communications
Each of the six configurations described above will be tested with these threeweight parameter combinations, which results in a total of 18 experiments.For each experiment, we will extract the following statistics:
• Load variance (LV), given by Equation 7.2, where for each mapping
solution x, load (PE i ) is the sum of the task costs assigned to the PE i,avgload is the average load defined by the sum of all task costs divided
by the number of PE, and p is the PE number.
• Maximal load (ML) defined as max(load(PEi)), where 0< i < p − 1.
• Total communication (TC) given by the sum of each edge cost timesthe NoC distance of the route related to that edge
• Maximal communication (MC) the maximum communication costfound between any of two PE
The LV statistic gives an approximation of the quality of the load balancing.The ML statistic is related to the lower-bound of the application performanceafter mapping, since the application throughput depends on the slowest PEprocessing The MC statistic gives as well a lower-bound on the applicationperformance, but this time with respect to the worst case of communica-tion contention (instead of w.r.t processing in the case of ML) Finally, the
TC indicator gives an approximation of the quality of the communicationmapping
Figure 7.8 shows the resulting LV statistic of the different application figurations and mapping weight combinations The best results are given bythe mapping weight combination c1 This is predictable because c1 promotessolutions with a good load balancing, which means a low LV value
con-Figure 7.9 presents the resulting ML statistics of the different applicationconfigurations and mapping weight combinations Following the same logic
as with Figure 7.8, the best results here are given by the mapping weightcombination c1
Figure 7.10 presents the resulting TC statistic of the different applicationconfigurations and mapping weight combinations This time, the best resultsare given by the mapping weight combination c3 This is predictable becausec3 promotes solutions with low communication costs
Figure 7.11 presents the resulting MC statistic for the different applicationconfigurations and mapping weight combinations Contrary to Figure 7.10,