Model-Based Design for Embedded Systems- P9 ppt

With the memory map information of each processor,the translator defines the shared variables in the shared region.To support a new target architecture in the proposed workflow, we have

Trang 1

the necessary information needed for each translation step Based on thetask-dependency information that tells how to connect the tasks, thetranslator determines the number of intertask communication channels.Based on the period and deadline information of tasks, the run-time sys-tem is synthesized With the memory map information of each processor,the translator defines the shared variables in the shared region.

To support a new target architecture in the proposed workflow, we have

to add translation rules of the generic API to the translator, make a specific-OpenMP-translator for data parallel tasks, and apply the generationrule of task scheduling codes tailored for the target OS Each step of CICtranslator will be explained in this section

target-8.5.1 Generic API Translation

Since the CIC task code uses generic APIs for target-independent cation, the translation of generic APIs to target-dependent APIs is needed

specifi-If the target processor has an OS installed, generic APIs are translated into

OS APIs; otherwise, they are translated into communication APIs that aredefined by directly accessing the hardware devices We implement the OSAPI library and communication API library, both optimized for each targetarchitecture

For most generic APIs, API translation is achieved by simple tion of the API function Figure 8.6a shows an example where the trans-lator replaces MQ_RECEIVE API with a “read_port” function for a targetprocessor with pthread support The read_port function is defined using

redefini-

MQ_RECEIVE (port_id, buf, size);

Generic API 1 int read_port(int channel_id, unsigned char *buf, int len) {

Trang 2

pthread APIs and the memcpy C library function However some APIsneed additional treatment: For example, the READ API needs differentfunction prototypes depending on the target architecture as illustrated inFigure 8.6b Maeng et al [14] presented a rule-based translation techniquethat is general enough to translate any API if the translation rule is defined

in a pattern-list file

8.5.2 HW-Interfacing Code Generation

If there is a code segment contained within a HW pragma section and itstranslation rule exists in an architecture information file, the CIC translatorreplaces the code segment with the HW-interfacing code, considering theparameters of the HW accelerator and buffer variables that are defined inthe architecture section of the CIC The translation rule of HW-interfacingcode for a specific HW is separately specified as a HW-interface library code.Note that some HW accelerators work together with other HW IPs.For example, a HW accelerator may notify the processor of its completionthrough an interrupt; in this case an interrupt controller is needed The CICtranslator generates a combination of the HW accelerator and interrupt con-troller, as shown in the next section

8.5.3 OpenMP Translator

If an OpenMP compiler is available for the target, then task codes withOpenMP directives can be used easily Otherwise, we somehow need totranslate the task code with OpenMP directives to a parallel code Note that

we do not need a general OpenMP translator since we use OpenMP tives only to specify the data parallel CIC task But we have to make a sepa-rate OpenMP translator for each target architecture in order to achieve opti-mal performance

direc-For a distributed memory architecture, we developed an OpenMP lator that translates an OpenMP task code to the MPI codes using a minimalsubset of the MPI library for the following reasons: (1) MPI is a standard that

trans-is easily ported to various software platforms (2) Porting the MPI library trans-ismuch easier than modifying the OpenMP translator itself for the new targetarchitecture Figure 8.7 shows the structure of the translated MPI program

As shown in the figure, the translated code has the master–workerstructure: The master processor executes the entire core while worker pro-cessors execute the parallel region only When the master processor meetsthe parallel region, it broadcasts the shared data to worker processors Then,all processors concurrently execute the parallel region The master proces-sor synchronizes all the processors at the end of the parallel loop and col-lects the results from the worker processors For performance optimization,

we have to minimize the amount of interprocessor communication betweenprocessors

Trang 3

Work alone

BCast share data

BCast share data Work

in parallel region

Work in parallel region

Work in parallel region Receive

&

update

Send shared data

Master processor

Worker processor

The workflow of translated MPI codes (From Kwon, S et al., ACM Trans.

Des Autom Electron Syst., 13, Article 39, July 2008 With permission.)

8.5.4 Scheduling Code Generation

The last step of the proposed CIC translator is to generate the task-schedulingcode for each processor core There will be many tasks mapped to eachprocessor, with different real-time constraints and dependency information

We remind the reader that a task code is defined by three functions: “{taskname}_init(), {task name}_go(), and {task name}_wrapup().” The generatedscheduling code initializes the mapped tasks by calling “{task name}_init()”and wraps them up after the scheduling loop finishes its execution, by calling

“{task name}_wrapup().”

The main body of the scheduling code differs depending on whetherthere is an OS available for the target processor If there is an OS that isPOSIX-compliant, we generate a thread-based scheduling code, as shown inFigure 8.8a A POSIX thread is created for each task (lines 17 and 18) with

an assigned priority level if available The thread, as shown in lines 3 to 5,executes the main body of the task, “{task name}_go(),” and schedules thethread itself based on its timing constraints by calling the “sleep()” method

If the OS is not POSIX-compliant, the CIC translator should be extended togenerate the OS-specific scheduling code

If there is no available OS for the target processor, the translator shouldsynthesize the run-time scheduler that schedules the mapped tasks The CICtranslator generates a data structure of each task, containing three mainfunctions of tasks (“init(), go(), and wrapup()”) With this data structure, a

Trang 4

1 void ∗thread_task_0_func(void∗argv) {

7 task taskInfo[] = { {task 1_init, task 1_go, task 1_wrapup, 100, 0}

8 , {task2_init, task2_go, task2_wrapup, 200, 0}};

18. init(); / ∗{task_name}_init() functions are called∗/

19. scheduler(); / ∗scheduler code∗/

20. wrapup(); / ∗{task_name}_wrapup() functions are called∗/

21. return 0;

22 }

(b)

FIGURE 8.8

Pseudocode of generated scheduling code: (a) if OS is available, and (b) if OS

is not available (From Kwon, S et al., ACM Trans Des Autom Electron Syst.,

13, Article 39, July 2008 With permission.)

Trang 5

real-time scheduler is synthesized by the CIC translator Figure 8.8b showsthe pseudocode of a generated scheduling code Generated schedulingcode may be changed by replacing the function “void scheduler()” or “intget_next_task()” to support another scheduling algorithm.

8.6.1 Design Space Exploration

We specified the functional parallelism of the H.263 decoder with six tasks

as shown in Figure 8.3, where each task is assigned an index For parallelism, the data parallel region of motion compensation task is specifiedwith an OpenMP directive In this experiment, we explored the design space

data-of parallelizing the algorithm, considering both functional and data lelisms simultaneously As is evident in Figure 8.3, tasks 1 to 3 can be executed

paral-in parallel; thus, they are mapped to multiple-processors with three rations as shown in Table 8.1 For example, task 1 is mapped to processor 1,and the other tasks are mapped to processor 0 for the second configuration

configu-Interrupt ctrl.

Local mem HW1 HW2 Arm926ej-s

Interrupt ctrl.

Local mem HW1 HW2 Arm926ej-s

HW3 Shared memory

FIGURE 8.9

The target architecture for preliminary experiments (From Kwon, S et

al., ACM Trans Des Autom Electron Syst., 13, Article 39, July 2008 With

permission.)

Trang 6

TABLE 8.1

Task Mapping to Processors

The Configuration of Task Mapping

Task 3, Task 4, Task 5

Task 0, Task 2, Task 3,Task 4, Task 5

Task 0, Task 3,Task 4, Task 5

Source: Kwon, S et al., ACM Trans Des Autom Electron Syst., 13, Article 39, July 2008 With

permission.

TABLE 8.2

Execution Cycles for Nine Configurations

The Configuration of Task Mapping The Number of Processors

configu-Table 8.2 shows the performance result for these nine configurations Forfunctional parallelism, the best performance can be obtained by using twoprocessors as reported in the first row (“No OpenMP” case) H.263 decoderalgorithm uses a 4:1:1 format frame, so computation of Y macroblock decod-ing is about four times larger than those of U and V macroblocks Thereforemacroblock decoding of U and V can be merged in one processor duringmacroblock decoding of Y in another processor There is no performancegain obtained by exploiting data parallelism This is because the computa-tion workload of motion compensation is not large enough to outweigh thecommunication overhead incurred by parallel execution

8.6.2 HW-Interfacing Code Generation

Next, we accelerated the code segment of IDCT in the macroblock ing tasks (task 1 to task 3) with a HW accelerator, as shown in Figure 8.10a

decod-We use the RealView SoC designer to model the entire system including the

Trang 7

#pragma hardware IDCT (output.data, input.data) {

/ ∗code segments for IDCT∗/}

infor-used (From Kwon, S et al., ACM Trans Des Autom Electron Syst., 13, Article

39, July 2008 With permission.)

HW accelerator Two kinds of inverse discrete cosine transformation (IDCT)accelerator are used One uses an interrupt signal for completion notifica-tion, and other uses polling to detect the completion The latter is specified

in the architecture section as illustrated in Figure 8.10b, where the library

name of the HW-interfacing code is set to IDCT_slave and its base address to

0x2F000000

Figure 8.11a shows the assigned address map of the IDCT accelerator andFigure 8.11b shows the generated HW-interfacing code This code is sub-stituted for the code segment contained within a HW pragma section InFigure 8.11b, bold letters are changeable according to the parameters spec-ified in a task code and in the architecture information file; they specify thebase address for the HW interface data structure and the input and outputport names of the associated CIC task

Note that interfacing code uses polling at line 6 of Figure 8.11b If we usethe accelerator with interrupt, an interrupt controller is additionally attached

to the target platform, as shown in Figure 8.10c, with information on the code

library name, IRQ_CONTROLLER, and its base address 0xA801000 The new

IDCT accelerator has the same address map as the previous one, except for

Trang 8

Address (Offset) I/O Type Comment

64 ∼ 191 Write Input data

192 ∼ 319 Read Output data (a)

1 int i;

2 volatile unsigned int ∗idct_base = (volatile unsigned int∗)0x2F000000;

3 while(idct_base[0]==1); // try to obtain hardware resource

4 for (i=0;i<32;i++) idct_base[i+16]= ((unsigned int∗)(input.data))[i];

5 idct_base[1]= 1; // send start signal to IDCT accelerator

6 while(idct_base[2]==0); // wait for completion of IDCT operation

7 for (i=0;i<32;i++) ((unsigned int∗)(output.data)[i] = idct_base[i+48];

8 idct_base[3] = 1; // clear and unlock hardware

(b)

FIGURE 8.11

(a) The address map of IDCT, and (b) its generated interfacing code (From

Kwon, S et al., ACM Trans Des Autom Electron Syst., 13, Article 39, July

2008 With permission.)

the complete flag The address of the complete flag (address 8 in Figure 8.11a)

is assigned to “interrupt clear.”

Figure 8.12a shows the generated interfacing code for the IDCT withinterrupt Note that the interfacing code does not access the HW to checkthe completion of IDCT, but checks the variable “complete.” In the gener-ated code of the interrupt handler, this variable is set to 1 (Figure 8.12b) Theinitialize code for the interrupt controller (“initDevices()”) is also generatedand called in the “{task_name}_init()” function

8.6.3 Scheduling Code Generation

We generated the task-scheduling code of the H.263 decoder while ing the working conditions, OS support, and scheduling policy At first, weused the eCos real-time OS for arm926ej-s in the RealView SoC designer,and generated the scheduling code, the pseudocode of which is shown inFigure 8.13 In function cyg_user_start() of eCos, each task is created as athread The CIC translator generates the parameters needed for thread cre-ation such as stack variable information and stack size (fifth and sixth param-eter of cyg_thread_create()) Moreover, we placed “{task_name}_go” in awhile loop inside the created thread (lines 10 to 14 of Figure 8.13) Function

chang-{task_name}_init()is called in init_task()

Note that TE_main() is also created as a thread TE_main() checkswhether execution of all tasks is finished, and calls “{task_name}_wrapup()”

in wrapup_task() before finishing the entire program

Trang 9

1 int complete;

2. .

3 volatile unsigned int ∗idct_base= (volatile unsigned int∗) 0x 2F000000;

4 while(idct_base[0] == 1); // try to obtain hardware resource

5 complete = 0;

6 for (i=0;i<32;i++) idct_base[i+16]= ((unsigned int∗)(input.data))[i];

7 idct_base[1] = 1; // send start signal to IDCT accelerator

8 while(complete==0); // wait for completion of IDCT operation

9 for(i = 0; i < 32; i + +) ((unsigned int∗)(output.data)[i]= idct_base[i + 48];

10 idct_base[3] = 1; // clear and unlock hardware

(a)

1 extern int complete;

2 irq void IRQ_Handler() {

3 IRQ_CLEAR(); // interrupt clear of interrupt controller

4 idct_base[2] =1; // interrupt clear of IDCT

FIGURE 8.12

(a) Interfacing code for the IDCT with interrupt, and (b) the interrupt handler

code (From Kwon, S et al., ACM Trans Des Autom Electron Syst., 13, Article

39, July 2008 With permission.)

For a processor without OS support, the current CIC translator supportstwo kinds of scheduling code: default and rate-monotonic scheduling (RMS).The default scheduler just keeps the execution frequency of tasks consideringthe period ratio of tasks Figure 8.14a and b show the pseudocode of functionget_next_task(), which is called in the function scheduler() of Figure 8.8b, forthe default and RMS, respectively

8.6.4 Productivity Analysis

For the productivity analysis, we recorded the elapsed time to manuallymodify the software (including debugging time) when we change the targetarchitecture and task mapping Such manual modification was performed by

an expert programmer who is a PhD student

For a fair comparison of automatic code generation and manual-codingoverhead, we made the following assumptions First, the application taskcodes are prepared and functionally verified We chose an H.263 decoder asthe application code that consists of six tasks, as illustrated in Figure 8.3.Second, the simulation environment is completely prepared for the ini-tial configuration, as shown in Figure 8.15a We chose the RealView SoCdesigner as the target simulator, prepared two different kinds of HW IPs

Trang 10

1 void cyg_user_start(void) {

2 cyg_threaad_create(taskInfo[0]->priority, TE_task_0,

3 (cyg_addrword_t)0, “TE_task_0”, (void ∗)&TaskStk[0],

4 TASK_STK_SIZE-1, &handler[0], &thread[0]);

Pseudocode of an automatically generated scheduler for eCos (From Kwon,

S et al., ACM Trans Des Autom Electron Syst., 13, Article 39, July 2008 With

permission.)

1 int get_next_task() {

2 a ﬁnd executable tasks

3 b ﬁnd the tasks that has the smallest value of time count

4 c select the task that is not executed for the longest time

5 d add period to the time count of selected task

6 e return selected task id

7 } (a)

1 int get_next_task() {

2 a ﬁnd executable tasks

3 b select the task that has the smallest period

4 c update task information

5 d return selected task id

6 } (b)

FIGURE 8.14

Pseudocode of “get_next_task()” without OS support: (a) default, and (b)

RMS scheduler (From Kwon, S et al., ACM Trans Des Autom Electron Syst.,

13, Article 39, July 2008 With permission.)

for the IDCT function block Third, the software environment for the get system is prepared, which includes the run-time scheduler and target-dependent API library

Trang 11

tar-Arm926ej-s Local mem Arm926ej-s Local mem.

Arm926ej-s Local mem Arm926ej-s Local mem.

Arm926ej-s Local mem.

Shared memory Shared memory

et al., ACM Trans Des Autom Electron Syst., 13, Article 39, July 2008 With

of our initial configuration (Figure 8.15a) The simulation porting overhead

is directly proportional to the target-dependent code size In addition, theoverhead increases as total code size increases, since we need to identify thetarget-dependent codes throughout the entire application code

Next, we changed the target architecture to those shown in Figure 8.15band c by using two kinds of IDCT HW IPs The interface code between pro-cessor and IDCT HW should be inserted It took about 2–3 h to write anddebug the interfacing code with IDCT HW IP, without and with the inter-rupt controller, respectively The sizes of the interface without and with theinterrupt controller were 14 and 48 lines of code, respectively Note that theoverhead will increase if the HW IP has a more complex interfacing protocol.Last, we modified the task mapping by adding one more processor, asshown in Figure 8.15d For this analysis, we needed to make an additionaldata structure of software tasks to link with the run-time scheduler on eachprocessor It took about 2 h to make the data structure of all tasks and attach

Trang 12

TABLE 8.3

Time Overhead for Manual Software Modification

Modifying HW interface code to useinterrupt controller (Figure 8.15a

By contrast, in the proposed framework, design space exploration is ply performed by modifying the architecture information file only, not taskcode Modifying the architecture information file is much easier than modi-fying the task code directly, and needs only a few minutes Then CIC transla-tor generates the target code automatically in a minute Of course, it requires

sim-a significsim-ant sim-amount of time to estsim-ablish the trsim-anslsim-ation environment for sim-anew target But once the environment is set up for each candidate processingelement, we believe that the proposed framework improves design produc-tivity dramatically for design space exploration of various architecture andtask-mapping candidates

8.7 Conclusion

In this chapter, we presented a retargetable parallel programming work for MPSoC, based on a new parallel programming model called theCIC The CIC specifies the design constraints and task codes separately Fur-thermore, the functional parallelism and data parallelism of application tasksare specified independently of the target architecture and design constraints

Trang 13

frame-Then, the CIC translator translates the CIC into the final parallel code, sidering the target architecture and design constraints, to make the CICretargetable Temporal parallelism is exploited by inserting pipeline buffersbetween CIC tasks and where to put the pipeline buffers is determined atthe mapping stage We have developed a mapping algorithm that considerstemporal parallelism as well as functional and data parallelism [17].

con-Preliminary experiments with a H.263 decoder example prove the ity of the proposed parallel programming framework: It increases the designproductivity of MPSoC software significantly There are many issues to beresearched further in the future, which include the optimal mapping of CICtasks to a given target architecture, exploration of optimal target architec-ture, and optimizing the CIC translator for specific target architectures Inaddition, we have to extend the CIC to improve the expression capability ofthe model

viabil-References

1 Message Passing Interface Forum, MPI: A message-passing interface

standard, International Journal of Supercomputer Applications and High

Per-formance Computing,8(3/4), 1994, 159–416

2 OpenMP Architecture Review Board, OpenMP C and C++ applicationprogram interface, http://www.openmp.org, Version 1.0, 1998

3 M Sato, S Satoh, K Kusano, and Y Tanaka, Design of OpenMP compiler

for an SMP cluster, in EWOMP’99, Lund, Sweden, 1999.

4 F Liu and V Chaudhary, A practical OpenMP compiler for system on

chips, in WOMPAT 2003, Toronto, Canada, June 26–27, 2003, pp 54–68.

5 Y Hotta, M Sato, Y Nakajima, and Y Ojima, OpenMP implementationand performance on embedded renesas M32R chip multiprocessor, in

EWOMP, Stockholm, Sweden, October, 2004

6 W Jeun and S Ha, Effective OpenMP implementation and translation for

multiprocessor system-on-chip without using OS, in 12th Asia and South

Pacific Design Automation Conference (ASP-DAC’2007), Yokohama, Japan,

2007, pp 44–49

7 R Eigenmann, J Hoeflinger, and D Padua, On the automatic

paralleliza-tion of the perfect benchmarks(R), IEEE Transacparalleliza-tions on Parallel and

Dis-tributed Systems, 9(1), 1998, 5–23

8 G Martin, Overview of the MPSoC design challenge, in 43rd Design

Automation Conference, San Francisco, CA, July, 2006, pp 274–279

Trang 14

9 P G Paulin, C Pilkington, M Langevin, E Bensoudane, and G lescu, Parallel programming models for a multi-processor SoC platform

Nico-applied to high-speed traffic management, in CODES+ISSS 2004,

11 A Jerraya, A Bouchhima, and F Petrot, Programming models and

HW-SW interfaces abstraction for multi-processor SoC, in 43rd Design

Automation Conference, San Francisco, CA, July 24–28, 2006, pp 280–285

12 K Balasubramanian, A Gokhale, G Karsai, J Sztipanovits, and

S Neema, Developing applications using model-driven design

environ-ments, IEEE Computer, 39(2), 2006, 33–40.

13 K Kim, J Lee, H Park, and S Ha, Automatic H.264 Encoder synthesis

for the cell processor from a target independent specification, in 6th IEEE

Workshop on Embedded Systems for Real-time Multimedia (ESTIMedia’2008),Atlanta, GA, 2008

14 J Maeng, J Kim, and M Ryu, An RTOS API translator for model-driven

embedded software development, in 12th IEEE International Conference on

Embedded and Real-Time Computing Systems and Applications (RTCSA’06),Sydney, Australia, August 16–18, 2006, pp 363–367

15 S Ha, C Lee, Y Yi, S Kwon, and Y Joo, PeaCE: A hardware-software

codesign environment for multimedia embedded systems, ACM

Transac-tions on Design Automation of Electronic Systems (TODAES), 12(3), Article

24, August 2007

products_socd.shtml

17 H Yang and S Ha, Pipelined data parallel task mapping/scheduling

technique for MPSoC, in DATE 2009, Nice, France, April 2009.

18 S Kwon, Y Kim, W Jeun, S Ha, and Y Paek, A retargetable

parallel-programming framework for MPSoC, ACM Transactions on Design

Automation of Electronic Systems (TODAES), 13(3), Article 39, July 2008

Tiêu đề	Model-Based Design for Embedded Systems
Trường học	Standard University
Chuyên ngành	Embedded Systems
Thể loại	Bài báo
Thành phố	City Name

Định dạng
Số trang	30
Dung lượng	762,43 KB