Several programming examples using mixed C and ASM code, whichprovide necessary background, were given in Chapter 3.Example 8.1: Sum of Products with Word-Wide Data Access for Fixed-Poi
Trang 18.1 INTRODUCTION
Begin at a workstation level; for example, use C code on a PC While code written
in assembly (ASM) is processor-specific, C code can readily be ported from one form to another However, optimized ASM code runs faster than C and requiresless memory space
plat-Before optimizing, make sure that the code is functional and yields correctresults After optimizing, the code can be so reorganized and resequenced that theoptimization process makes it difficult to follow One needs to realize that if a C-coded algorithm is functional and its execution speed is satisfactory, there is no need
to optimize further
After testing the functionality of your C code, transport it to the C6x platform
A floating-point implementation can be modeled first, then converted to a point implementation if desired If the performance of the code is not adequate, use
fixed-Copyright © 2002 John Wiley & Sons, Inc ISBNs: 0-471-20754-3 (Hardback); 0-471-22112-0 (Electronic)
Trang 2different compiler options to enable software pipelining (discussed later), reduceredundant loops, and so on If the performance desired is still not achieved, you canuse loop unrolling to avoid overhead in branching This generally improves the exe-cution speed but increases code size You also can use word-wide optimization byloading/accessing 32-bit word (int) data rather than 16-bit half-word (short) data.You can then process lower and upper 16-bit data independently.
If performance is still not satisfactory, you can rewrite the time-critical section ofthe code in linear assembly, which can be optimized by the assembler optimizer Theprofiler can be used to determine the specific function(s) that need to be optimizedfurther
The final optimization procedure that we discuss is a software pipelining scheme to produce hand-coded ASM instructions [1,2] It is important to follow theprocedure associated with software pipelining to obtain an efficient and optimizedcode
8.2 OPTIMIZATION STEPS
If the performance and results of your code are satisfactory after any particular step,you are done
1 Program in C Build your project without optimization.
2 Use intrinsic functions when appropriate as well as the various optimization
levels
3 Use the profiler to determine/identify the function(s) that may need to be
further optimized Then convert these function(s) in linear ASM
4 Optimize code in ASM.
8.2.1 Compiler Options
When the optimizer is invoked, the following steps are performed A C-codedprogram is first passed through a parser that performs preprocessing functions andgenerates an intermediate file (.if) which becomes the input to an optimizer Theoptimizer generates an opt file which becomes the input to a code generator forfurther optimizations and generates an ASM file
The options:
1 –o0 optimizes the use of registers.
2 –o1 performs a local optimization in addition to optimizations performed by
the previous option: –o0
3 –o2 performs a global optimization in addition to the optimizations
per-formed by the previous options: –o0 and –o1
Trang 34 –o3 performs a file optimization in addition to the optimizations performed
by the three previous options: –o0, –o1, and –o2
The options –o2 and –o3 attempt to do software optimization
8.2.2 Intrinsic C Functions
There are a number of available C intrinsic functions that can be used to increasethe efficiency of code (see also Example 3.1):
1 int_mpy() has the equivalent ASM instruction MPY, which multiplies the
16 LSBs of a number by the 16 LSBs of another number
2 int_mpyh() has the equivalent ASM instruction MPYH, which multiplies the
16 MSBs of a number by the 16 MSBs of another number
3 int_mpylh() has the equivalent ASM instruction MPYLH, which multiplies
the 16 LSBs of a number by the 16 MSBs of another number
4 int_mpyhl() has the equivalent instruction MPYHL, which multiplies the
16 MSBs of a number by the 16 LSBs of another number
5 void_nassert(int) generates no code It tells the compiler that the
expression declared with the assert function is true This conveys information
to the compiler about alignment of pointers and arrays and of valid mization schemes, such as word-wide optimization
opti-6 uint_lo(double) and uint_hi(double) obtain the low and high 32 bits
of a double word, respectively (available on C67x or C64x)
8.3 PROCEDURE FOR CODE OPTIMIZATION
1 Use instructions in parallel so that multiple functional units can be operated
within the same cycle
2 Eliminate NOPs or delay slots, placing code where the NOPs are.
3 Unroll the loop to avoid overhead with branching.
4 Use word-wide data to access a 32-bit word (int) in lieu of a 16-bit half-word
(short)
5 Use software pipelining, illustrated in Section 8.5.
8.4 PROGRAMMING EXAMPLES USING CODE OPTIMIZATION
TECHNIQUES
Several examples are developed to illustrate various techniques to increase the ciency of code Optimization using software pipelining is discussed in Section 8.5
Trang 4effi-The dot product is used to illustrate the various optimization schemes effi-The dotproduct of two arrays can be useful for many DSP algorithms, such as filtering and correlation The examples that follow assume that each array consists of 200numbers Several programming examples using mixed C and ASM code, whichprovide necessary background, were given in Chapter 3.
Example 8.1: Sum of Products with Word-Wide Data Access for
Fixed-Point Implementation Using C Code (twosum)
Figure 8.1 shows the C code twosum.c, which obtains the sum of products of two
arrays accessing 32-bit word data Each array consists of 200 numbers Separatesums of products of even and odd terms are calculated within the loop Outside theloop, the final summation of the even and odd terms is obtained
For a floating-point implementation, the function and the variables sum, suml, and sumh in Figure 8.1 are cast as float, in lieu of int:
float dotp (float a[ ], float b [ ])
return (sum);
}
FIGURE 8.1 C code for sum of products using word-wide data access for separate
accu-mulation of even and odd sum of products terms (twosum.c).
Trang 5Example 8.2: Separate Sum of Products with C Intrinsic Functions
Using C Code (dotpintrinsic)
Figure 8.2 shows the C code dotpintrinsic.c to illustrate the separate sum of products using two C intrinsic functions, _mpy and _mpyh, which have the
equivalent ASM instructions MPY and MPYH, respectively Whereas the even and oddsum of products are calculated within the loop, the final summation is taken outsidethe loop and returned to the calling function
Example 8.3: Sum of Products with Word-Wide Access for Fixed-Point Implementation Using Linear ASM Code (twosumlasmfix.sa)
Figure 8.3 shows the linear ASM code twosumlasmfix.sa, which obtains two
separate sums of products for a fixed-point implementation using linear ASM code
It is not necessary to specify either the functional units or NOPs Furthermore, bolic names can be used for registers The LDW instruction is used to load a 32-bitword-wide data value (which must be word-aligned in memory when using LDW).Lower and upper 16-bit products are calculated separately The two ADD instruc-tions accumulate separately the even and odd sum of products
sym-//dotpintrinsic.c Sum of products with C intrinsic functions using C
for (i = 0; i < 100; i++)
{
suml = suml + _mpy(a[i], b[i]);
sumh = sumh + _mpyh(a[i], b[i]);
}
return (suml + sumh);
FIGURE 8.2 Separate sum of products using C intrinsic functions (dotpintrinsic.c).
;twosumlasmfix.sa Sum of Products Separate accum of even/odd terms
;With word-wide data for fixed-point implementation using linear ASM
ADD prodl, suml, suml ;accum even terms ADD prodh, sumh, sumh ;accum odd terms
FIGURE 8.3 Separate sum of products using linear ASM code for fixed-point
implemen-tation (twosumlasmfix.sa).
Trang 6Example 8.4: Sum of Products with Double-Word Load for Floating-Point Implementation Using Linear ASM Code (twosumlasmfloat)
Figure 8.4 shows the linear ASM code twosumlasmfloat.sa to obtain two
sepa-rate sums of products for a floating-point implementation using linear ASM code.The double-word load instruction LDDW loads a 64-bit data value and stores it in
a pair of registers Each single-precision multiply instruction MPYSP performs a
32 ¥ 32 multiplication The sums of products of the lower and upper 32 bits are performed to yield a sum of both even and odd terms as 32 bits
Example 8.5: Dot Product with No Parallel Instructions for Fixed-Point Implementation Using ASM Code (dotpnp)
Figure 8.5 shows the ASM code dotpnp.asm for the dot product with no
instruc-tions in parallel for a fixed-point implementation A fixed-point implementation can
;twosumlasmfloat.sa Sum of products Separate accum of even/odd terms
;Using double-word load LDDW for floating-point implementation
LDDW *bptr++, bi1:bi0 ;64-bit word bi0 and bi1 MPYSP ai0, bi0, prodl ;lower 32-bit product MPYSP ai1, bi1, prodh ;hiagher 32-bit product ADDSP prodl, suml, suml ;accum 32-bit even terms ADDSP prodh, sumh, sumh ;accum 32-bit odd terms
FIGURE 8.4 Separate sum of products with LDDW using linear ASM code for floating-point
implementation (twosumlasmfloat.sa).
;dotpnp.asm ASM Code with no-parallel instructions for fixed-point
FIGURE 8.5 ASM code with no parallel instructions for fixed-point implementation
(dotpnp.asm).
Trang 7be performed with all C6x devices, whereas a floating-point implementationrequires a C67x platform such as the C6711 DSK.
The loop iterates 200 times With a fixed-point implementation, each pointer register A4 and A8 increments to point at the next half-word (16 bits) in each buffer,whereas with a floating-point implementation, a pointer register increments thepointer to the next 32-bit word The load, multiply, and branch instructions must usethe D, M, and S units, respectively; the add and subtract instructions can use anyunit (except M) The instructions within the loop consume 16 cycles per iteration.This yields 16 ¥ 200 = 3200 cycles Table 8.4 shows a summary of several optimiza-tion schemes for both fixed- and floating-point implementations
Example 8.6: Dot Product with Parallel Instructions for Fixed-Point
Implementation Using ASM Code (dotpp)
Figure 8.6 shows the ASM code dotpp.asm for the dot product with a fixed-point
implementation with instructions in parallel With code in lieu of NOPs, the number
Example 8.7: Two Sums of Products with Word-Wide (32-bit) Data for
Fixed-Point Implementation Using ASM Code (twosumfix)
Figure 8.7 shows the ASM code twosumfix.asm, which calculates two separate
sums of products using word-wide access of data for a fixed-point implementation.The loop count is initialized to 100 (not 200) since two sums of products are obtained
;dotpp.asm ASM Code with parallel instructions for fixed-point
;branch occurs here
FIGURE 8.6 ASM code with parallel instructions for fixed-point implementation
(dotpp.asm).
Trang 8per iteration The instruction LDW loads a word or 32-bit data The multiply tion MPY finds the product of the lower 16 ¥ 16 data, and MPYH finds the product ofthe upper 16 ¥ 16 data The two ADD instructions accumulate separately the evenand odd sums of products Note that an additional ADD instruction is needed outsidethe loop to accumulate A7 and B7 The instructions within the loop consume eightcycles, now using 100 iterations (not 200), to yield 8 ¥ 100 = 800 cycles.
instruc-Example 8.8: Dot Product with No Parallel Instructions for Floating-Point Implementation Using ASM Code (dotpnpfloat)
Figure 8.8 shows the ASM code dotpnpfloat.asm for the dot product with a
floating-point implementation using no instructions in parallel The loop iterates
200 times The single-precision floating-point instruction MPYSP performs a 32 ¥ 32multiply Each MPYSP and ADDSP requires three delay slots The instructions withinthe loop consume a total of 18 cycles per iteration (without including three NOPsassociated with ADDSP) This yields a total of 18 ¥ 200 = 3600 cycles (See Table 8.4for a summary of several optimization schemes for both fixed- and floating-pointimplementations.)
Example 8.9: Dot Product with Parallel Instructions for Floating-Point
Implementation Using ASM Code (dotppfloat)
Figure 8.9 shows the ASM code dotppfloat.asm for the dot product with a
floating-point implementation using instructions in parallel The loop iterates 200
;twosumfix.asm ASM code for two sums of products with word-wide data
;for fixed-point implementation
MPY M1x A2,B2,A6 ;lower 16-bit product in A6
;branch occurs here
FIGURE 8.7 ASM code for two sums of products with 32-bit data for fixed-point
imple-mentation (twosumfix.asm).
Trang 9times By moving the SUB and B instructions up to take the place of some NOPs, thenumber of instructions within the loop is reduced to 10 Note that three additionalNOPs would be needed outside the loop to retrieve the result from ADDSP Theinstructions within the loop consume a total of 10 cycles per iteration This yields atotal of 10 ¥ 200 = 2000 cycles.
Example 8.10: Two Sums of Products with Double-Word-Wide (64-bit) Data for Floating-Point Implementation Using ASM Code (twosumfloat)
Figure 8.10 shows the ASM code twosumfloat.asm, which calculates two separate
sums of products using double-word-wide access of 64-bit data for a floating-pointimplementation The loop count is initialized to 100 since two sums of products are
;dotpnpfloat.asm ASM with no parallel instructions for floating-point
FIGURE 8.8 ASM code with no parallel instructions for floating-point implementation
(dotpnpfloat.asm).
;dotppfloat.asm ASM Code with parallel instructions for floating-point
;branch occurs here
FIGURE 8.9 ASM code with parallel instructions for floating-point implementation
(dotppfloat.asm).
Trang 10obtained per iteration The instruction LDDW loads a 64-bit double-word data valueinto a register pair The multiply instruction MPYSP performs a 32 ¥ 32 multiply Thetwo ADDSP instructions accumulate separately the even and odd sums of products.The additional ADDSP instruction is needed outside the loop to accumulate A7 andB7 The instructions within the loop consume a total of 10 cycles, using 100 itera-tions (not 200), to yield a total of 10 ¥ 100 = 1000 cycles.
8.5 SOFTWARE PIPELINING FOR CODE OPTIMIZATION
Software pipelining is a scheme to write efficient code in ASM so that all the tional units are utilized within one cycle Optimization levels –o2 and –o3 enablecode generation to generate (or attempt to generate) software-pipelined code.There are three stages associated with software pipelining:
func-1 Prolog (warm-up) This stage contains instructions needed to build up the
loop kernel (cycle)
2 Loop kernel (cycle) Within this loop, all instructions are executed in parallel.
The entire loop kernel is executed in one cycle, since all the instructions within
the loop kernel stage are in parallel
3 Epilog (cool-off) This stage contains the instructions necessary to complete
all iterations
;twosumfloat.asm ASM Code for two sums of products for floating-point
LOOP LDDW D1 *A4++,A3:A2 ;64-bit into register pair A2,A3
|| LDDW D2 *B4++,B3:B2 ;64-bit into register pair B2,B3
MPYSP M1x A2,B2,A6 ;lower 32-bit product in A6
ADDSP L1 A6,A7,A7 ;accum even terms in A7
;branch occurs here
ADDSP L1x A7,B7,A4 ;final sum of even and odd terms
FIGURE 8.10 ASM code with two sums of products for floating-point implementation
(twosumfloat.asm).