Model-Based Design for Embedded Systems- P3 pptx

Annotation of C code for a basic block Architectural model C code corresponding to the cache analysis blocks of the basic block Cache model Branch prediction model C code corresponding

Trang 1

a translation of the binary code into the SystemC code generates a fast codecompared to an interpreting ISS, as no decoding of instructions is needed andthe generated SystemC code can be easily used within a SystemC simulationenvironment However, this approach has some major disadvantages Onemain drawback is that the same problems that have to be solved in the staticcompilation (binary translation) have to be solved here (e.g., addresses ofcalculated branch targets have to be determined) Another disadvantage isthat the automatically generated code is not very easily read by humans.

2.4.1 Back-Annotation of WCET/BCET Values

In this section, we will describe our approach in more detail Figure 2.6 shows

an overview of the approach

First, the C source code has to be taken and translated using an ordinary

C (cross)-compiler into the binary code for the embedded processor (sourceprocessor) After that, our back-annotation tool reads the object file and adescription of the used source processor This description contains both adescription of the architecture and a description of the instruction set of theprocessor

Figure 2.4 shows an example for the description of the architecture Itcontains information about the resources of the processor (Figure 2.4a) Thisinformation is used for the modeling of the pipeline Furthermore, it contains

a description of the properties of the instruction (Figure 2.4b) and data caches(Figure 2.4c) Furthermore, such a description can contain information aboutthe branch prediction of the processor

Annotation of C code for a basic block

Architectural model

C code corresponding to the cache

analysis blocks of the basic block Cache

model Branch prediction model

C code corresponding to a basic block

Function call of consume function

if necessary (e.g before I/O access)

Back-annotation of WCET/BCET values (From Schnerr, J et al.,

High-performance timing simulation of embedded software, in: Proceedings of the

45th Design Automation Conference (DAC), Anaheim, CA, pp 290–295, June

2008 Copyright: ACM Used with permission.)

Trang 2

Example for a description of the architecture.

Figure 2.5 shows an example for the description of the instruction set.This description contains information about the structure of the bit image ofthe instruction code (Figure 2.5c) It also contains information to determinethe timing behavior of instructions and the timing behavior of instructionsthat are executed in context with other instructions (Figure 2.5d) Further-more, for debugging and documentation purposes more information aboutthe instruction can be given (Figure 2.5a and b)

Using this description, the object code is decoded and translated into

an intermediate representation consisting of a list of objects Each of theseobjects represents one intermediate instruction

In the next step, the basic blocks of this program are determined using theintermediate representation consisting of a list of objects As a result, usingthis list, a list of basic blocks is built

After that, the execution time is statically calculated for each basic blockwith respect to the provided pipeline model of the proposed source proces-sor This calculation step is described in more detail in Section 2.4.3

Subsequently, the back-annotation correspondences between the Csource code and the binary code are identified Then, the back-annotationprocess takes place This is done by automated code instrumentation forcycle generation and dynamic cycle correction The structure and functional-ity of this code are described in Section 2.4.2

Not every impact of the processor architecture on the number of cyclescan be predicted statically Therefore, if dynamic, data-dependent effects(e.g., branch prediction and caches) have to be taken into account, an

Trang 3

Left-shift the contents of data register Da by the amount specified

by n, where n can be 0, 1, 2, or 3 Add that value to the contents (b)

of address register Ab and put the result in address register Ac.

Example for a description of an instruction

additional code needs to be added Further details concerning this code aredescribed in Section 2.4.5

During back-annotation, the C program is transformed into a accurate SystemC program that can be compiled to be executed on the pro-cessor of the simulation host (target processor)

cycle-One advantage of this approach is a fast execution of the annotated code

as the C source code does not need major changes for back-annotation over, the generated SystemC code can be easily used within a SystemC sim-ulation environment The difficulty in using this approach is to find the cor-responding parts of the binary code in the C source code if the compiler opti-mizes or changes the structure of the binary code too much If this happens,recompilation techniques [4] have to be used to find the correspondences

More-2.4.2 Annotation of SystemC Code

On the left-hand side of Figure 2.3, there is the necessary annotation of apiece of the C code that corresponds to a basic block The right-hand side of

Trang 4

Binary code

Annotated SystemC program

Find correspondences between

C source code and binary code

Construction of intermediate representation

Insertion of dynamic correction code Insertion of cycle generation code

In modern processor architectures, the impact of the processor architecture

on the number of executed cycles cannot be completely predicted statically.Especially the branch prediction and the caches of a processor have a sig-nificant impact on the number of used cycles Therefore, the statically deter-mined number of cycles has to be corrected dynamically The partitioning

of the basic block for the calculation of additional cycles of instruction cachemisses, as shown in Figure 2.3, is explained in Section 2.4.5

Trang 5

If there is a conditional branch at the end of a basic block, branch tion has to be considered and possible correction cycles have to be added.This is described in more detail in Section 2.4.5.

predic-As shown in Figure 2.3, the back-annotation tool adds a call to the

consume function that performs cycle generation at the end of each basicblock code If necessary, this instruction generates the number of cycles thisbasic block would need on the source processor How thisconsumefunctionworks is described in Section 2.4.7

In order to guarantee both—as fast as possible the execution of the code

as well as the highest possible accuracy—it is possible to choose differentaccuracy levels of the generated code that parameterize the annotation tool.The first and the fastest one is a purely static prediction The second oneadditionally includes the modeling of the branch prediction And the thirdone takes also the dynamic inclusion of instruction caches into account.The cycle calculation in these different levels will be discussed in moredetail in the following sections

2.4.3 Static Cycle Calculation of a Basic Block

In modern architectures, pipeline effects, superscalarity, and caches have animportant impact on the execution time Because of this, a calculation of theexecution time of a basic block by summing the execution or latency times ofthe single instructions of this block is very inaccurate

Therefore, the incorporation of a pipeline model per basic block becomesnecessary [21] This model helps statically predict pipeline effects and theeffects of superscalarity For the generation of this model, informations aboutthe instruction set and the pipelines of the used processor are needed Theseinformations is contained in the processor description that is used by theannotation tool With regard to this, the tool uses a modeling of the pipeline

to determine which instructions of the basic block will be executed in parallel

on a superscalar processor and which combinations of instructions in thebasic block will cause pipeline stalls Details of this will be described in thenext section

With the information gained by basic block modeling, a prediction is ried out This prediction determines the number of cycles the basic blockwould have needed on the source processor

car-Section 2.4.5 will show how this kind of prediction is improved duringruntime, and how a cache model is included

2.4.4 Modeling of Pipeline for a Basic Block

As previously mentioned, the processor description contains informations

of the resources the processor has and of the resources a certain instructionuses These informations about the resources are used to build a resourceusage model that specifies microarchitecture details of the used processor

Trang 6

For this model, it is assumed that all units in the processor such as tional units, pipeline stages, registers, and ports form a set of resources.These resources can be allocated or released by every instruction that is exe-cuted This means that the resource usage model is based on the assumptionthat every time when an instruction is executed, this instruction allocates aset of resources and carries out an action When the execution proceeds, theallocated resources and the carried-out actions change.

func-If two instructions wait for the same resource, then this is resolved byallocating the resource to the instruction that entered the pipeline earlier.This model is powerful enough to describe pipelines, superscalar execution,and other microarchitectures

2.4.4.1 Modeling with the Help of Reservation Tables

The timing information of every program construct can be described with

a reservation table Originally, reservation tables were proposed to describeand analyze the activities in a pipeline [32] Traditionally, reservation tableswere used to detect conflicts for the scheduling of instructions [25] In a reser-vation table, the vertical dimension represents the pipeline stages and thehorizontal dimension represents the time Figure 2.7 shows an example of

a basic block and the corresponding reservation table In the figure, everyentry in the reservation table shows that the corresponding pipeline stage

is used in the particular time slot The entry consists of the number of theinstruction that uses the resource The timing interdependencies between theinstructions of a basic block are analyzed using the composition of their basicblock

In the reservation table, not only conflicts that occur because of the ent pipeline stages, but also data dependencies between the instructions can

differ-be considered

7 6 5 5 4 3 2 1 2 1

5 2 1

5 2 1 4 3 4 3 4 3 4 3

int-Pipeline

EX DI FI

DI EX WB

WB FI

Trang 7

2.4.4.1.1 Structural Hazards

In the following, a modeling of the instructions in a pipeline using tion tables will be described [12,32] To determine at which time after thestart of an instruction the execution of a new instruction can start withoutcausing a collision, these reservation tables have to be analyzed One possi-

reserva-bility to determine if two instructions can be started in the distance of K time units is to overlap the reservation with itself using an offset of K time units.

If a used resource is overlapped by another, then there will be a collision in

this segment and K is a forbidden latency Otherwise, no collision will occur and K is an allowed latency.

2.4.4.1.2 Data Hazards

The time delay caused by data hazards is modeled in the same way as thedelay caused by structural hazards As the result of the pipelining of aninstruction sequence should be the same as the result of sequentially exe-cuted instructions, register accesses should be in the same order as they are inthe program This restriction is comparable with the usage of pipeline stages

in the order they are in the program, and, therefore, it can be modeled by anextension of the reservation table

2.4.4.1.3 Control Hazards

Some processors (like the MIPS R3000 [12]) use delayed branches to avoidthe waiting cycle that otherwise would occur because of the control hazard.This can be modeled by adding a delay slot to the basic block with the branchinstruction Such a modeling is possible, because the instruction in the delayslot is executed regardless of the result of the branch instruction

2.4.4.2 Calculation of Pipeline Overlapping

In order to be able to model the impact of architectural components such

as pipelines, the state of these components has to be known when the basicblock is entered If the state is known, then it is possible to find out the gainthat results from the use of this component

If it is known that in the control-flow graph of the program, node e iis the

predecessor of node e j , and the pipeline state after the execution of node e i

is also known, then the information about this state can be used to calculate

the execution time of node e j This means the gain resulting from the fact that

node e i is executed before node e jcan be calculated

The gain will be calculated for every pair of succeeding basic blocks usingthe pipeline overlapping This pipeline overlapping is determined usingreservation tables [29] Appending a reservation table of a basic block to areservation table of another basic block works the same way as appending

an instruction to this reservation table Therefore, it is sufficient to consideronly the first and the last columns The maximum number of columns that

Trang 8

have to be considered does not have to be larger than the maximum number

of cycles for which a single instruction can stay in the pipeline [21]

2.4.5 Dynamic Correction of Cycle Prediction

As previously described, the actual cycle count a processor needs for cuting a sequence of instructions cannot be predicted correctly in all cases.This is the case if, for example, a conditional branch at the end of a basicblock produces a pipeline flush, or if additional delays occur because of cachemisses in instruction caches The combination of static analysis and dynamicexecution provides a well-suited solution for this problem, since staticallyunpredictable effects of branch and cache behaviors can be determined dur-ing execution This is done by inserting appropriate function calls into thetranslated basic blocks These calls interact with the architectural model inorder to determine the additional number of cycles caused by mispredictedbranch and cache behaviors At the end of each basic block, the generation

exe-of previously calculated cycles (static cycles plus correction cycles) can occur(Figure 2.3)

2.4.5.1 Branch Prediction

Conditional branches have different cycle times depending on four ent cases resulting from the combination of predicted and mispredictedbranches, as well as taken and non-taken branches A correctly predictedbranch needs less cycles for execution than a mispredicted one Furthermore,additional cycles can be needed if a correctly predicted branch is taken, asthe branch target has to be calculated and loaded in the program counter.This problem is solved by implementing a model of the branch predictionand by a comparison of the predicted branch behavior with the executedbranch behavior If dynamic branch prediction is used, a model of the under-lying state machine is implemented and its results are compared with theexecuted branch behavior The cycle count of each possible case is calcu-lated and added to the cumulative cycle count before the next basic block isentered

differ-2.4.5.2 Instruction Cache

Figure 2.3 shows that for the simulation of the instruction cache, every basicblock of the translated program has to be divided into several cache analysisblocks This has to be done until the tag changes or the basic block ends Afterthat, a function call to the cache handling model is added This code uses acache model to find out possible cache hits or misses

The cache simulation will be explained in more detail in the next fewparagraphs This explanation will start with a description of the cachemodel

Trang 9

data lru

tag v asm_inst1

asm_instl+1

asm_instnasm_inst2l+1asm_inst2l

asm_instlC_stmnt1

con-the valid bit, con-the cache tag, and con-the least recently used (lru) information

(con-taining the replacement strategy) for each cache set during runtime is saved.The number of cache tags and the according amount of valid bits thatare needed depend on the associativity of the cache (e.g., for a two-way setassociative cache, two sets of tags and valid bits are needed)

2.4.5.4 Cache Analysis Blocks

In the middle of Figure 2.8, the C source code that corresponds to a basicblock is divided in several smaller blocks, the so-called cache analysis blocks.These blocks are needed for the consideration of the effects of instructioncaches Each one of these blocks contains the part of a basic block that fitsinto a single cache line

As every machine language instruction in such a cache analysis block hasthe same tag and the same cache index, the addresses of the instructions can

be used to determine how a basic block has to be divided into cache analysisblocks This is because each address consists of the tag information and thecache index

The cache index information (iStart to iEnd in Figure 2.3) is used to

deter-mine at which cache position the instruction with this address is cached Thetag information is used to determine which address was cached, as there can

be multiple addresses with the same cache index Therefore, a changed cachetag can be easily determined during the traversal of the binary code withrespect to the cache parameters The block offset information is not neededfor the cache simulation, as no real caching of data takes place

After the tag has been changed or at the end of a basic block, a functioncall that handles the simulated cache and the calculation of the additionalcycles of cache misses are added to this block More details about this func-tion are described in the next section

Trang 10

use lru information to determine tag to overwrite

write new tag

set valid bit of written tag

renew lru information

return additional cycles needed for cache miss

Function for cache cycle correction

2.4.5.5 Cycle Calculation Code

As previously mentioned, each cache analysis block is characterized by acombination of tag and cache-set index informations At the end of eachbasic block, a call to a function is included During runtime, this functionshould determine whether the different cache analysis blocks that the basicblock consists of are in the simulated cache or not This way, cache missesare detected

The function is shown in Listing 2.1 It has the tag and the range of

cache-set indices (iStart to iEnd) as parameters.

To find out if there is a cache hit or a cache miss, the function checkswhether the tag of each cache analysis block can be found in the specified setand whether the valid bit for the found tag is set

If the tag can be found and the valid bit is set, the block is already cached

(cache hit) and no additional cycles are needed Only the lru information has

to be renewed

In all other cases, the lru information has to be used to determine which

tag has to be overwritten After that, the new tag has to be written instead of

the found old one, and the valid bit for this tag has to be set The lru

infor-mation has to be renewed as well In the final step, the additional cycles arereturned and added to the cycle correction counter

Trang 11

2.4.6 Consideration of Task Switches

In modern embedded systems, software performance simulation has to dle task switching and multiple interrupts Cooperative task scheduling canalready be handled by the previously mentioned approach since the pre-sented cache model is able to cope with nonpreemptive task switches Inter-rupts, and cooperative and nonpreemptive task scheduling can be handledsimilarly because the task preemption is usually implemented by using soft-ware interrupts Therefore, the incorporation of interrupts is discussed in thefollowing

han-Software interrupts had to be included in the SystemC model This hasbeen achieved by the automatic insertion of dedicated preemption pointsafter cycle calculation This approach provides an integration of differentuser-defined task scheduling policies, and a task switch generates a soft-ware interrupt Since the cycle calculation is completed before a task switch

is executed and a global cache and branch prediction model is used, no otherchanges are necessary A minor deviation of the cycle count for certain pro-cesses can occur because of the actual task switch that is carried out with asmall delay caused by the projection of the task preemption at the binary-code level to the C/C++ source-code level But, nevertheless, the cumulativecycle count is still correct The accuracy can be increased by the insertion ofthe cycle calculation code after each C/C++ statement

If the additional delay caused by the context switch itself has to beincluded, the (binary) code of the context switch routine can be treated likeany other code

2.4.7 Preemption of Software Tasks

For the modeling of unconditional time delays, there is the function

wait(sc_time)in SystemC The call of wait(Δt) by a SystemC thread

at the simulation time t suspends the calling thread until the simulation time

t + Δt is reached, and after that it continues its execution with the proceeding instruction The time that Δt needs is independent of the number of other

active tasks at that time in the system Therefore, thewaitfunction is able for the delay of hardware functionality, as this is inherently parallel Incontrast, software tasks can only be executed if they are allocated to a cor-responding execution unit This means that the execution of a software taskwill be suspended as soon as the execution unit is withdrawn by the oper-ating system In order to model the software timing behavior, two functionshave to be used The first function is thedelay(int)function, as shown inListing 2.2 As previously mentioned, this function is used for a fine gran-ular addition of time The second one is theconsume(sc_time) functionthat does a coarse-grained consumption of time of the accumulated delays.This function is an extension of the functionwait(sc_time)with an appro-priate condition as needed Listing 2.3 shows such a consume(sc_time)

suit-function

Trang 12

If a software task calls theconsume function with a time value, T, as

a parameter, it decrements the time only if the calling software task is inthe state RUNNING If the execution unit is withdrawn by the RTOS sched-uler by a change of the execution state, the decrementation of the time in the

consumefunction will be suspended By changing the state toRUNNINGbythe scheduler, the software task can allocate an execution unit again, lead-ing to a continuation of the decrementation of the time that was suspendedbefore

2.5 Experimental Results

In order to test the execution speed and the accuracy of the translated code,

a few examples were compiled using a C compiler into an object code for theInfineon TriCore processor [15] This object code was also used to generate

an annotated SystemC code from the C code, as described in Section 2.4.1 As

a reference, the execution speed and the cycle count of the TriCore code havebeen measured on a TriCore TC10GP evaluation board and on a TriCoreISS [16]

The examples consist of two filters (firandellip) and two programsthat are part of audio-decoding routines (dpcmandsubband)

Trang 13

dpcm fir ellip subband

Tricore luation board Annotated SystemC 1 Annotated SystemC 2 Tricore ISS

FIGURE 2.9

Comparison of speed (Copyright: ACM Used with permission.)

Figure 2.9 shows the comparison of the execution speed of the generatedcode with the execution speed of the TriCore evaluation board and the ISS.The execution speed in this figure is represented by million instructions ofthe TriCore Processor per second The Athlon 64 processor running the Sys-temC code and the ISS had a clock rate of 2.4 GHz The TriCore processor ofthe evaluation board ran at 48 MHz

Using the annotated SystemC code, two different types of annotationshave been used: the first one generates the cycles after the execution of each

Trang 14

basic block, the second one adds cycles to a cycle counter after each basicblock The cycles are only generated when it is necessary (e.g., when com-munication with the hardware takes place) This is much more efficient and

is depicted in Figure 2.9

The execution speed of the TriCore processor ranges from 36.8 to 50.8 lion instructions per second, whereas the execution speed of the annotatedSystemC that models with immediate cycle generation ranges from 3.5 to 5.7millions of simulated TriCore instructions per second This means that theexecution speed of the SystemC model is only about ten times slower thanthe speed of a real processor The execution speed of the annotated SystemCcode with on-demand cycle generation ranges from 11.2 to 149.9 million Tri-Core instructions per second

mil-In order to compare the SystemC execution speed with the executionspeed of a conventional ISS, the same examples were run using the Tri-Core ISS The result was an execution speed ranging from 1.5 to 2.4 mil-lion instructions per second This means our approach delivers an executionspeed increase of up to 91%

A comparison of the number of simulated cycles of the generated temC code using branch prediction and cache simulation with the number

Sys-of executed cycles Sys-of the TriCore evaluation board is shown in Figure 2.10.The deviation of the cycle counts of the translated programs (with branch

0 2500

eva-FIGURE 2.10

Comparison of cycle accuracy (Copyright: ACM Used with permission.)

Trang 15

prediction and caches included) compared to the measured cycle count fromthe evaluation board ranges between 4% for the programfirto 7% for theprogramdpcm This is in the same range as it is using conventional ISS.

2.6 Outlook

As clock frequencies cannot be increased as linearly as the number of cores,modern processor architectures can exploit multiple cores to satisfy increas-ing computational demands The different cores can share architecturalresources such as data caches to speed up the access to common data There-fore, access conflicts and coherency protocols have a potential impact on theruntimes of tasks executing on the cores

The incorporation of multiple cores is directly supported by our SystemCapproach Parallel tasks can easily be assigned to different cores, and thecode instrumentation by cycle information can be carried out independently.However, shared caches can have a significant impact on the number of exe-cuted cycles This can be solved by the inclusion of a shared cache modelthat executes global cache coherence protocols, such as the MESI protocol

A clock calculation after each C/C++ statement is strongly recommendedhere to increase the accuracy

2.7 Conclusions

This chapter presented a methodology for the SystemC-based performanceanalysis of embedded systems To obtain a high accuracy with an acceptableruntime, a hybrid approach for a high-performance timing simulation of theembedded software was given The approach shown was implemented in anautomated design flow The methodology is based on the generation of theSystemC code out of the original C code and the back-annotation of the stat-ically determined cycle information into the generated code Additionally,the impact of data dependencies on the software runtime is analytically han-dled during simulation Promising experimental results from the application

of the implemented design flow were presented These results show a highexecution performance of the timed embedded software model as well asgood accuracy Furthermore, the created SystemC models representing thetimed embedded software could be easily integrated into virtual SystemCprototypes because of the generated TLM interfaces

Tiêu đề	Model-Based Design for Embedded Systems
Trường học	Unknown Institution
Chuyên ngành	Embedded Systems
Thể loại	Document

Định dạng
Số trang	30
Dung lượng	716,21 KB