Thus, the system architecture model expresses parallelism inthe target application through capturing the mapping of the functions intotasks and the tasks into subsystems.. 248 Model-Base
Trang 1246 Model-Based Design for Embedded Systems
HDS API
HAL CPU
Comm.
HALAPI Task 1 Task 2 Task q
Abstract CPUs
& native SW execution
HdS API
HdS API Comm OS Task 1 Task 2 Task n
Abstract sub system comm.
intra-& native SW execution HdS API Task 1 Task 2 Task n
Abstract sub system comm.
intra-& native SW execution HdS API Task 1 Task 2 Task n
Abstract sub system comm.Abstract intra-sub system comm.
Abstract sub system comm.
intra-Task 1 intra-Task 2 intra-Task n
Abstract sub system comm.
intra-Task 1 intra-Task 2 intra-Task n
Intra-sub syst comm.
CPU Peripherals Intra-sub syst comm.
CPU Peripherals
Intra-sub syst comm.
Partitioning and mapping
Mapping comm.
on HW resources
SW adapt to specific HW comm implementation
intra-Task 1 intra-Task 2 intra-Task n
Abstract subsystem comm.
intra-Abstract inter-subsystem communication
Abstract subsystem comm.
intra-& native SW execution HdS API Task 1 Task 2 Task n
Abstract subsystem comm.
Intra-subsystem communication
Inter-subsystem communication
HDS API
HAL CPU
HAL API Task 1 Task 2 Task n
Intra-subsyst.comm.
FIGURE 9.6
MPSoC programming steps
The result of each of these four phases represents a step in thesoftware and communication refinement process The refinement is anincremental process At each stage, additional software component and
Trang 2architecture details are integrated with the previously generated andvalidated components This results to a gradual transformation of a highlevel representation with abstract components and high level programmingmodels into a concrete low level executable software code The transforma-tion has to be validated at each design step The validation can be performed
by formal analysis, simulation, or combining simulation with formal sis [23] In the following, we will use simulation-based validation to ensurethat the system behavior respects the initial specification
analy-During the partitioning and mapping of the application on the target
archi-tecture, the relationship between application and architecture is defined Thisrefers to the number of application tasks that can be executed in parallel, thegranularity of these tasks (coarse grain or fine grain), and the associationbetween tasks and the processors that will execute them
The result of this step is the decomposition of the application into tasksand the association between tasks and processors The resulting model is thesystem architecture model The system architecture model represents a func-tional description of the application specification, combined with the parti-tioning and mapping information Aspects related to the architecture model(e.g., processing units available in the target hardware platform) are com-bined into the application model (i.e., multiple tasks executed on the pro-cessing units) Thus, the system architecture model expresses parallelism inthe target application through capturing the mapping of the functions intotasks and the tasks into subsystems It also makes explicit the communicationunits to abstract the intra-subsystem communication protocols (the commu-nication between the tasks inside a subsystem) and the inter-subsystem com-munication protocols (the communication between different subsystems)
The second step implements the mapping of communication onto the
hard-ware platform resources At this phase, the different links used for thecommunication between the different tasks are mapped on the hardwareresources available in the architecture to implement the specified protocol.For example, a FIFO communication unit can be mapped to a hardwarequeue, a shared memory or some kind of bus-based device The task code
is adapted to the communication mechanism through the use of adequateHdS communication primitives The resulting model is named virtual archi-tecture model
The next step of the proposed flow consists of software adaptation to specific
the communication protocol are detailed, for example, the synchronizationmechanism between the different processors running in parallel becomesexplicit The software code has to be adapted to the synchronization method,such as events or semaphores This can be done by using the services of OSand communication components of the software stack The resulting model
is the Transaction Accurate Architecture model
The final step corresponds to specific adaptation of the software to the
Trang 3248 Model-Based Design for Embedded Systems
processor dependent software code into the software stack (HAL) to allowlow level access to the hardware resources and the final memory mapping.The resulting model is called Virtual Prototype model
These different steps of the global flow correspond to different softwarecomponents generation and validation at different abstraction levels
9.6 Experiments with H.264 Encoder Application
In this section, we apply the proposed programming environment for a plex MPSoC architecture The target application corresponds to the H.264encoder, also called AVC (advanced video coding) Firstly, the specification
com-of the target architecture and application are given, and then, the ming steps at the system architecture, virtual architecture, transaction accu-rate architecture, and virtual prototype levels are described, respectively
program-9.6.1 Application and Architecture Specification
The H.264 encoder application is a video processing multimedia tion that supports coding and decoding of 4:2:0 YUV video formats [24] Themain functions of the H.264 encoder are illustrated in Figure 9.7 The input
each consisting of 16 pixels To encode a macroblock, there are three mainsteps: (1) prediction, with the main blocks motion estimation-ME, motioncompensation-MC, and frame filtering; (2) transformation with quantization(T, Q, and Reorder); and (3) entropy encoding (CABAC in this case) TheH.264 standard supports seven sets of capabilities, which are referred to
Intra pred.
+ –
Filter
FIGURE 9.7
H.264 encoder
Trang 4DSP2 SS DSP1 SS
FIGURE 9.8
Diopsis R2DT with Hermes NoC
as profiles, targeting specific class of applications In this section, the mainprofile will be used as an application case study
The target MPSoC architecture is named Diopsis R2DT (RISC + 2 DSP)tile [25] As shown in Figure 9.8, it contains three SW-SS: one ARM9 RISCprocessor subsystem and two ATMEL magicV VLIW DSP processing sub-systems
The hardware nodes represent the global external memory (DXM) andPOT (peripherals on tile) subsystem The POT subsystem contains theperipherals of the ARM9 processor and the I/O peripherals of the tile Allthe three processors may access the local memories and registers of the otherprocessors and also the distributed external memory (DXM) The differentsubsystems are interconnected using the Hermes network on chip (NoC),which supports two types of topologies: Mesh and Torus [26]
9.6.2 Programming at the System Architecture Level
Programming at the system architecture level consists of functional ing of the application, partitioning the application into the tasks, and map-ping them onto the processing subsystems
model-Therefore, the H.264 application functions are mapped onto the available
SW-SS, as shown in Figure 9.9 Thus, the DSP1-SS is responsible for encoding
a frame of the video sequence The DSP2-SS compresses the encoded frame The ARM9-SS creates the final bitstream and computes the bit-rate controller.
The application executes in pipeline fashion and requires three application
data transfers between the processors: COMM1 between DSP1 and DSP2,
The resulting system architecture is modeled using the Simulink ronment To validate the H.264 encoder algorithm, the system architecture
Trang 5envi-250 Model-Based Design for Embeđed Systems
T
+
+ +
DSP1-SS
COMM2
.yuv
FIGURE 9.9
System architecture model of H.264
model is simulated using a discrete-time simulation enginẹ The input testvideo is a 10 frames video sequence in QCIF YUV 420 format The simula-tion requires approximately 30 s on a PC running at 1.73 GHz with 1 GBytesRAM
The H.264 simulation allowed validating the functionality, but also suring early execution requirements Thus, the total number of iterations nec-essary to decode the 10 frames video sequence was equal with the number
mea-of frames This is because mea-of the fact that all the application functions mented in Simulink operate at the frame level The communication betweenthe DSP1 and DSP2 processors uses a communication unit that requires abuffer of 288,585 words to transmit the encoded frame from the DSP1 pro-cessor to the DSP2 in order to be compressed The DSP2 processor and theARM9 processor communicate through a communication unit that requires
imple-a buffer of 19,998 words The limple-ast communicimple-ation unit between the ARM9and DSP1 processors requires one word buffer size in order to store thequanta value required for the encoder The total number of words exchangedbetween the different subsystems during the encoding process of the 10frames video sequence, using main profile configuration of the encoder algo-rithm, was approximately 3085 kWords
9.6.3 Programming at the Virtual Architecture Level
Programming at the virtual architecture level consists of generating the Ccode for each task from the system architecture model The generated tasks
code for the H.264 encoder application uses send_datặ )/recv_datặ ) APIs
for the communication primitives and is optimized in terms of data memoryrequirements
Table 9.4 shows the task code and data size of the software at the virtualarchitecture level The first two columns represent the code, respectively thedata size of the functions that are independent of the design and optimiza-tion methods, which are part of an independent librarỵ The third and fourth
Trang 6TABLE 9.4
Task Code Generation for H.264 Encoder
Library Code Library Data Multitasking Code Multitasking Data
T1
T2 T1
FIGURE 9.10
Global view of Diopsis R2DT running H.264
columns show the code and data size obtained with memory optimizationtechniques
The hardware at the virtual architecture level consists of a SystemC ware platform, consisting of abstract processor subsystems and interconnectcomponents Figure 9.10 illustrates a conceptual view of the virtual architec-ture for the Diopsis R2DT with Hermes NoC
hard-The virtual architecture can be simulated not only to validate the taskscode, but also to gather important early performance measurements to pro-file the interconnect charge, for instance, the number of words exchangedbetween the tasks through the network component or the total packets initi-ated for the transfer by various subsystems
Figure 9.11 shows the total words passed through the NoC in case of ferent communication mapping schemes Hence, when all the communica-tion buffers are mapped on the DXM memory, as shown in Figure 9.10, theNoC is accessed to transfer 6,171,680 words during the encoding process of
dif-the 10 frames In anodif-ther case, comm1 is mapped on DXM, comm2 on REG2 and comm3 on DMEM1 This case required 5,971,690 words to be transferred through the NoC A third case maps comm1 on DMEM1, comm2 on DMEM2, and comm3 on SRAM and it generates 3,085,840 words to be operated by
the NoC
Trang 7252 Model-Based Design for Embedded Systems
Read/Write Total Sent
In all the communication mapping schemes, the simulation time required
to encode the 10 image frames using QCIF YUV 420 format was mately 40 s on a PC running Linux OS at 1.73 GHz
Trang 8approxi-9.6.4 Programming at the Transaction Accurate
Architecture Level
Programming at the transaction accurate architecture level means to buildeach software stack running on the processors This consists of combiningthe tasks code with the OS and communication libraries Thus, the H.264tasks code previously designed is combined with a tiny OS necessary for theinterrupts management and the tasks initialization, and the implementation
of the send_datặ )/recv_datặ ) communication primitives The processors
execute single task on top of the OS
The transaction accurate architecture of the Diopsis R2DT tile with mes NoC is illustrated in Figure 9.12 The hardware platform is composed
Her-of the three processor subsystems (ARM9-SS, DSP1-SS, and DSP2-SS), oneglobal MEM-SS, and the peripherals on tile subsystem (POT-SS), all sub-systems having the local architecture detailed The different subsystems areinterconnected through an explicit Hermes NoC
The simulation of the transaction accurate architecture allows validatingthe integration of the tasks code with the OS and communication libraries,but it also provides better performance estimation, such as communicationperformances
At this level, in order to analyze the overall system performance, weexperimented with several communication architectures by changing theinterconnection component and/or communication mapping schemẹ TheNoC allows various mapping schemes of the IPs over the NoC with differentimpact on performancẹ In this work, two different mappings of the IP cores
MEM-SS
DXM
NI
SRAM ARM9-SS
NI
Abstract ARM9
HdS API Comm OS HAL API
HdS API Comm OS HAL API
T1 T3
Trang 9254 Model-Based Design for Embedded Systems
IP cores mapping schemes A and B over the NoC
over the Mesh and Torus NoC are experimented: Scheme A and Scheme B,respectively Figure 9.13 summarizes these schemes by presenting the corre-spondence between the Network Interface and the IP core, e.g., the MEM-SS
and y coordinates are 1).
Table 9.6 presents the results of the transaction accurate simulationsfor various interconnection components (AMBA bus, NoC) with differenttopologies for the NoC (Torus, Mesh), different IP cores mapping over theNoC and diverse communication buffer mapping schemes The estimatedperformance indicators are: estimated execution cycles of the H.264 encoder,the simulation time using the different interconnect components on a PCrunning at 1.73 GHz with 1 GBytes RAM and the total routing requestsfor the NoC These results were evaluated for the two considered IP map-ping schemes shown in Figure 9.13 (A and B) and for three communication
buffer mapping schemes: DXM+DXM+DXM, DMEM1+DMEM2+SRAM and DMEM1+SRAM+DXM The AMBA had the best performance, as it
implied the fewest clock cycles during the execution for all the cation mapping schemes The Mesh NoC attained the worse performance in
communi-case of mapping all the communication buffers onto the DXM and similar
performance with the Torus in case of using the local memories
This is explained by the small numbers of subsystems interconnectedthrough the NoC In fact, NoCs are very efficient in architectures withmore than 10 IP cores interconnected, while they can have a compara-ble performance results with the AMBA bus in less complex architectures.Between the NoCs, the Torus has better path diversity than the Mesh Thus,Torus reduces network congestion and decreases the routing requests Also,Scheme A of IP cores mapping provided better results than Scheme B for the
the performance of Scheme A was superior to Scheme B In fact, the ideal
IP cores mapping scheme would have the communicating IPs separated byonly one hop (number of intermediate routers) over the network to reducelatency
9.6.5 Programming at the Virtual Prototype Level
Programming at the virtual prototype level consists of integrating the HALlayer into the software stack for each particular processor subsystem and to
Trang 11256 Model-Based Design for Embedded Systems
ARM9-SS MEM-SS
Mailbox
DMEM1 REG1
PIC
DSP2-SS DSP1-SS
NI
SRAM
ARM9 ISS
SW Stack ARM9
SW Stack DSP1
SW Stack DSP2
DSP2 ISS
Mailbox
DMEM2 REG2 SPI
AIC
HAL HAL API OS HdS API T3
Comm
HAL HAL API OS HdS API T2
Comm
HAL HAL API OS HdS API T1
Comm Hermes NOC
Trang 121 H Meyr, Application specific processors (ASIP): On design and
imple-mentation efficiency, Proceeding of SASIMI 06, Nagoya, Japan, 2006.
6 J Turley, Survey says: Software tools more important than chips,
embedded.com/columns/surveys/160700620?_requestid=177492
7 MPICH—MPI implementation http://www-unix.mcs.anl.gov/mpi/mpich/index.htm
8 W Wolf, High-Performance Embedded Computing: Architectures,
Fran-cisco, CA, 2006
9 D Culler, J.P Singh, A Gupta, Parallel Computer Architecture: A Hardware/
CA, August 1998, ISBN 1558603433
10 P Paulin, C Pilkington, M Langevin, E Bensoudane, D Lyonnard,
O Benny, B Lavigueur, D Lo, G Beltrame, V Gagne, G Nicolescu, allel programming models for a multi-processor SoC platform applied
Par-to networking and multimedia, IEEE Transactions on VLSI Journal, 14(7),
12 A Jerraya, W Wolf, Hardware-software interface codesign for
embed-ded systems, Computer, 38(2), 63–69, February 2005.
13 D Skillicorn, D Talia, Models and languages for parallel computation,
Trang 13258 Model-Based Design for Embedded Systems
14 A Jerraya, A Bouchhima, F Petrot, Programming models and HW-SW
interfaces abstraction for multi-processor SoC, Proceeding of DAC 2006,
San Francisco, CA, 2006, pp 280–285
15 Simulink, The MathWorks Inc., http://www.mathworks.com
16 F Ghenassia, Transaction-Level Modeling with SystemC TLM Concepts and
centric approach, Special Session, Proceeding of CODES+ISSS 2004,
Stock-holm, Sweden, September 2004
19 D.R Butenhof, Programming with POSIX Threads, Addison Wesley,
Boston, MA, May, 1997
20 E Cheong, J Liebman, J Liu, F Zhao, TinyGALS: A programming model
for event-driven embedded systems, Proceeding of 2003 ACM Symposium
21 J.A Rowson, Hardware/software cosimulation, Proceeding of DAC 1994,
San Diego, CA, June 6–10, 1994, pp 439–440
co-verification in C/C++, Proceeding of ASP-DAC 2000, Yokohama,
Japan, 2000, pp 405–408
23 S Kunzli, F Poletti, L Benini, L Thiele, Combining simulation and
for-mal methods for system-level performance analysis, Proceeding of DATE
24 J.-W Chen, C.-Y Kao, Y.-L Lin, Introduction to H.264, Proceeding of
25 P.S Paolucci, A.A Jerraya, R Leupers, L Thiele, P Vicini, SHAPES: A tiledscalable software hardware architecture platform for embedded systems,
26 F Moraes et al., HERMES: An infrastructure for low area overhead
packet-switching networks-on-chip integration, VLSI Journal, 38(1), 2004,
69–93
Trang 14Platform-Based Design and Frameworks:
Felice Balarin, Massimiliano D’Angelo, Abhijit Davare, Douglas
Densmore, Trevor Meyerowitz, Roberto Passerone, Alessandro Pinto, Alberto Sangiovanni-Vincentelli, Alena Simalatsar, Yosinori Watanabe, Guang Yang, and Qi Zhu
CONTENTS
10.1 Introduction 260
10.2 Platform-Based Design 261
10.2.1 Design Challenge 261
10.2.2 Principles of Platform-Based Design 262
10.2.2.1 PBD Flow 263
10.2.2.2 “Fractal” Nature of PBD: Successive Refinements 264
10.2.2.3 Design Parameters for PBD 266
10.3 METROPOLISDesign Environment 267
10.3.1 Overview 267
10.3.2 METROPOLISMeta-Model 268
10.3.2.1 Function Modeling 268
10.3.2.2 Architecture Modeling 269
10.3.2.3 Mapping 271
10.3.2.4 Recursive Paradigm of Platforms 273
10.3.3 METROPOLISTools 275
10.3.3.1 Simulation 275
10.3.3.2 Formal Property Verification 276
10.3.3.3 Simulation Monitor 276
10.3.3.4 Quasi-Static Scheduling 277
10.4 METROII Design Environment 278
10.4.1 Overview 278
10.4.2 METROII Design Elements 279
10.4.2.1 Components 280
10.4.2.2 Ports 282
10.4.2.3 Constraint Solvers 282
10.4.2.4 Annotators and Schedulers 283
10.4.2.5 Mappers 283
10.4.2.6 Adaptors 284
10.4.3 METROII Semantics 284
10.4.3.1 Three-Phase Execution 285
10.4.3.2 Semantics of Required/Provided Ports 287
10.4.3.3 Semantics of Mapping 287
Trang 15260 Model-Based Design for Embedded Systems
10.5 Related Work 292
10.5.1 Origin of METROII: From Polis to METROPOLIS 292
10.5.2 Industrial Approaches 294
10.5.3 Academic Approaches 295
10.6 Case Studies 301
10.6.1 UMTS 301
10.6.1.1 Functional Modeling 301
10.6.1.2 Architectural Modeling 304
10.6.1.3 Mapped System 306
10.6.1.4 Results 306
10.6.1.5 Conclusions 312
10.6.2 Intelligent Buildings: Indoor Air Quality 313
10.7 Conclusions 315
Acknowledgments 316
References 317
10.1 Introduction
System-level design (SLD) means many different things to many different people In our view, SLD is about the design of a whole that consists of several components where specifications are given in terms of functionality along with
• Constraints on the properties the design has to satisfy
• Constraints on the components that are available for implementation
• Objective functions that express the desirable features of the design when completed
This definition is general since it relates to many application domains from semiconductors to systems such as cars, airplanes, buildings, telecom-munications, and biological systems To deal with system-level problems, our view is that the issue to address is not developing new tools, albeit they are essential to advance the state of the art in design; rather it is the understanding of the principles of system design, the necessary change to design methodologies, and the dynamics of the supply chain Developing this understanding is necessary to define a sound approach to the needs of the system and component industry as they try to serve their customers bet-ter, and to develop their products faster and with higher quality This chapter
is about principles and how a unified methodology together with a support-ing software framework, as challengsupport-ing as it may seem, can be developed to bring the embedded electronics industry to a new level of efficiency
To demonstrate this view, we will first present the design challenges for future systems and a manifesto espousing the benefits of a unified methodol-ogy We will then summarize a methodology, platform-based design (PBD), that has been developed over the past decade and that we believe can fulfill