FIGURE 77.2: Key architectural features of the TMS320C25.is used to store the memory address where the program will continue execution after a temporarydiversion to a subroutine.. A15-A0
Trang 1Papamichalis, P “Introduction to the TMS320 Family of Digital Signal Processors”
Digital Signal Processing Handbook
Ed Vijay K Madisetti and Douglas B Williams
Boca Raton: CRC Press LLC, 1999
Trang 2Introduction to the TMS320 Family
of Digital Signal Processors
Panos Papamichalis
Texas Instruments
77.1 Introduction77.2 Fixed-Point Devices: TMS320C25 Architecture and
77.11 Multiplier and ALU of the TMS320C3077.12 Other Architectural Features of the TMS320C3077.13 TMS320C30 Instruction Set
77.14 Other Generations and Devices in the TMS320 FamilyReferences
This article discusses the architecture and the hardware characteristics of the TMS320family of Digital Signal Processors The TMS320 family includes several generations
of programmable processors with several devices in each generation Since the grammable processors are split between fixed-point and floating-point devices, bothcategories are examined in some detail The TMS320C25 serves here as a simple examplefor the fixed-point processor family, while the TMS320C30 is used for the floating-pointfamily
pro-77.1 Introduction
Since its introduction in 1982 with the TMS32010 processor, the TMS320 family of DSPs has beenexceedingly popular Different members of this family were introduced to address the existingneeds for real-time processing, but then, designers capitalized on the features of the devices to createsolutions and products in ways never imagined before In turn, these innovations fed the architecturaland hardware configurations of newer generations of devices
Digital Signal Processing encompasses a variety of applications, such as digital filtering, speechand audio processing, image and video processing, and control All DSP applications share some
Trang 3common characteristics:
• The algorithms used are mathematically intensive A typical example is the tion of an FIR filter, implemented as sum-of-products This operation involves a lot ofmultiplications combined with additions
computa-• DSP algorithms must typically run in real time: i.e., the processing of a segment of thearriving signal must be completed before the next segment arrives, or else data will belost
• DSP techniques are under constant development This implies that DSP systems should
be flexible to support changes and improvements in the state of the art As a result,programmable processors have been the preferred way of implementation In recenttimes, though, fixed-function devices have also been introduced to address high-volumeconsumer applications with low-cost requirements
These needs are addressed in the TMS320 family of DSPs by using appropriate architecture, struction sets, I/O capabilities, as well as the raw speed of the devices However, it should be kept
in-in min-ind that these features do not cover all the aspects describin-ing a DSP device, and especially aprogrammable one Availability and quality of software and hardware development tools (such ascompilers, assemblers, linker, simulators, hardware emulators, and development systems), applica-tion notes, third-party products and support, hot-line support, etc play an important role on howeasy it will be to develop an application on the DSP processor The TMS320 family has very extensivesuch support, but its description goes beyond the scope of this article The interested reader shouldcontact the TI DSP hotline (Tel 713-274-2320)
For the purposes of this article, two devices have been selected to be highlighted from the TexasInstruments TMS320 family of digital signal processors One is the TMS320C25, a 16-bit, fixed-pointDSP, and the other is the TMS320C30, a 32-bit, floating-point DSP As a short-hand notation, theywill be called ‘C25 and ‘C30, respectively The choice was made so that both fixed-point issues areconsidered
There have been newer (and more sophisticated) generations added to the TMS320 family but,since the objective of this article is to be more tutorial, they will be discussed as extensions of the
‘C25 and the ‘C30 Such examples are other members of the ‘C2x and the ‘C3x generations, as well
as the TMS320C5x generation (‘C5x for short) of fixed-point devices, and the TMS320C4x (‘C4x) offloating-point devices Customizable and fixed-function extensions of this family of processors will
be also discussed
Texas Instruments, like all vendors of DSP devices, publishes detailed User’s Guides that explain atgreat length the features and the operation of the devices Each of these User’s Guides is a pretty thickbook, so it is not possible (or desirable) to repeat all this information here Instead, the objective ofthis article is to give an overview of the basic features for each device If more detail is necessary for
an application, the reader is expected to refer to the User’s Guides If the User’s Guides are needed,
it is very easy to obtain them from Texas Instruments
77.2 Fixed-Point Devices: TMS320C25 Architecture and
Fundamental Features
The Texas Instruments TMS320C25 is a fast, 16-bit, fixed-point digital signal processor The speed
of the device is 10 MHz, which corresponds to a cycle time of 100 ns Since the majority of theinstructions execute in a single cycle, the figure of 100 ns also indicates how long it takes to executeone instruction Alternatively, we can say that the device can execute 10 million instructions persecond (MIPS) The actual signal from the external oscillator or crystal has a frequency four timeshigher, at 40 MHz This frequency is then divided on-chip to generate the internal clock with a
Trang 4period of 100 ns Figure77.1 shows the relationship between the input clock CLKIN from theexternal oscillator, and the output clock CLKOUT CLKOUT is the same as the clock of the device,and it is related to CLKIN by the equation CLKOUT= CLKIN /4 Note that in Fig.77.1the shape
of the signal is idealized ignoring rise and fall times
FIGURE 77.1: Clock timing of the TMS320C25 CLKIN= external oscillator; CLKOUT = clock ofthe device
Newer versions of the TMS320C25 operate in higher frequencies For instance, there is a spinoffthat has a cycle time of 80 ns, resulting in a 12.5 MIPS operation There are also slower (and cheaper)versions for applications that do not need this computational power
Figure77.2shows in a simplified form the key features of the TMS320C25 The major parts of theDSP processor are the memory, the Central Processing Unit (CPU), the ports, and the peripherals.Each of these parts will be examined in more detail later The on-chip memory consists of 544words of RAM (read/write memory) and 4K words of ROM (read-only memory) In the notationused here, 1K= 1024 words, and 4K = 4 × 1024 = 4096 words Each word is 16 bits wide and,when some memory size is given, it is measured in 16-bit words, and not in bytes (as is the custom
in microprocessors) Of the 544 words of RAM, 256 words can be used as either program or datamemory, while the rest is only data memory All 4K of on-chip ROM is program memory Overall,the device can address 64K words of data memory and 64K words of program memory Except forwhat resides on-chip, the rest of the memory is external, supplied by the designer
The CPU is the heart of the processor Its most important feature, distinguishing it from thetraditional microprocessors, is a hardware multiplier that is capable of performing a 16× 16 bitmultiplication in a single cycle To preserve higher intermediate accuracy of results, the full 32-bit product is saved in a product register The other important part of the CPU is the ArithmeticLogic Unit (ALU) that performs additions, subtractions, and logical operations Again, for increasedintermediate accuracy, there is a 32-bit accumulator to handle all the ALU operations
All the arithmetic and logical functions are accumulator-based In other words, these operationshave two operands, one of which is always the accumulator The result of the operation is stored inthe accumulator
Because of this approach the form of the instructions is very simple indicating only what the otheroperand is This architectural philosophy is very popular but it is not universal For instance, as isdiscussed later, the TMS320C30 takes a different approach, where there are several “accumulators”
in what is called a register file
Other components of the TMS320C25 CPU are several shifters to facilitate manipulation of thedata and increase the throughput of the device by performing shifting operations in parallel withother functions As part of the CPU, there are also eight auxiliary registers that can be used as memorypointers or loop counters There are two status registers, and an 8-deep hardware stack The stack
Trang 5FIGURE 77.2: Key architectural features of the TMS320C25.
is used to store the memory address where the program will continue execution after a temporarydiversion to a subroutine
To communicate with external devices, the TMS320C25 has 16 input and 16 output parallel ports
It also has a serial port that can serve the same purpose The serial port is one of the peripherals thathave been implemented on chip Other peripherals include the interrupt mask, the global memorycapability, and a timer The above components of the TMS320C25 are examined in more detailbelow
The device has 68 pins that are designated to perform certain functions, and to communicatewith other devices on the same board The names of the signals and the corresponding definitionsappear in Table77.1 The first column of the table gives the pin names Note that a bar over thename indicates that the pin is in the active position when it is electrically low For instance, if thepins take the voltage levels of 0 V and 5 V, a pin indicated with an overbar is asserted when it is set
at 0 V Otherwise, assertion occurs at 5 V The second column indicates if the pin is used for input
to the device or output from the device or both The third column gives a description of the pinfunctionality
Understanding the functionality of the device pins is as important as understanding the internalarchitecture because it provides the designer with the tools available to communicate with the externalworld The DSP device needs to receive data and, often, instructions from the external sources, andsend the results back to the external world Depending on the paths available for such transactions,the design of a program can take very different forms Within this framework, it is up to the designer
to generate implementations that are ingenious and elegant
The TMS320C25 has its own assembly language to be programmed This assembly languageconsists of 133 instructions that perform general-purpose and DSP-specific functions Familiaritywith the instruction set and the device architecture are the two components of efficient programimplementation High-level-language compilers have also been developed that make the writing ofprograms an easier task For the TMS320C25, there is a C compiler available However, there isalways a loss of efficiency when programming in high-level languages, and this may not be acceptable
in computation-bound real-time systems Besides, for complete understanding of the device it isnecessary to consider the assembly language
Trang 6TABLE 77.1 Names and Functionality of the 68 pins of the TMS320C25
V CC I 5-V supply pins
V SS I Ground pins
X1 O Output from internal oscillator for crystal
X2/CLKIN I Input to internal oscillator from crystal or external clock
CLKOUT1 O Master clock output (crystal or CLKIN frequency/4)
CLKOUT2 O A second clock output signal
D15-D0 I/O/Z 16-bit data bus D15 (MSB) through DO (LSB) Multiplexed between program,
data, and I/O spaces.
A15-A0 O/Z 16-bit address bus A15 (MSB) through AO (LSB)
P S, DS, IS O/Z Program, data, and I/O space select signals
R/W O/Z Read/write signal
ST RB O/Z Strobe signal
INT 2-INT 0 I External user interrupt inputs
MP/MC I Microprocessor/microcomputer mode select pin
MSC O Microstate complete signal
IACK O Interrupt acknowledge signal
READY I Data ready input Asserted by external logic when using slower devices to
indicate that the current bus transaction is complete.
BR O Bus request signal Asserted when the TMS320C25 requires access to an external
global data memory space.
XF O External flag output (latched software-programmable signal)
HOLD I Hold input When asserted TMS320C25 goes into an idle mode and places
the data, address, and control lines in the high impedance state.
HOLDA O Hold acknowledge signal.
SY NC I Synchronization input.
BIO I Branch control input Polled by BIOZ instruction
DR I Serial data receive input
CLKR I Clock for receive input for serial port
FSR I Frame synchronization pulse for receive input
DX O/Z Serial data transmit output
CLKX I Clock for transmit output for serial port
FSX I/O/Z Frame synchronization pulse for transmit Configurable as either an input or
an output.
aI/O/Z denotes input/output/high-impedance state.
Note: The first column is the pin name; the second column indicates if it is an input or an output pin; the third
column gives a description of the pin functionality.
A very important characteristic of the device is its Harvard architecture In Harvard architecture(see Fig.77.3), the program and data memory spaces are separated and they are accessed by differentbuses One bus accesses the program memory space to fetch the instructions, while another bus isused to bring operands from the data memory space and store the results back to memory Theobjective of this approach is to increase the throughput by bringing instructions and data in parallel
An alternate philosophy is the von Neuman architecture The von Neuman architecture (see Fig.77.4)uses a single bus and a unified memory space Unification of the memory space is convenient forpartitioning it between program and data, but it presents a bottleneck since both data and programinstructions must use the same path and, hence, they must be multiplexed The Harvard architecture
of multiple buses is used in digital signal processors because the increased throughput is of paramountimportance in real-time systems
The difference of the architectures is important because it influences the programming style InHarvard architecture, two memory locations can have the same address, as long as one of them is
in the data space and the other is in the program space Hence, when the programmer uses anaddress label, he has to be alert as to what space he is referring Another restriction of the Harvardarchitecture is that the data memory cannot be initialized during loading because loading refersonly to placing the program on the memory (and the program memory is separate from the datamemory) Data memory can be initialized during execution only The programmer must incorporatesuch initialization in his program code As it will be seen later, such restrictions have been removedfrom the TMS320C30 while retaining the convenient feature of multiple buses
Figure77.5 shows a functional block diagram of the TMS320C25 architecture The Harvard
Trang 7FIGURE 77.3: Simplified block diagram of the Harvard architecture.
FIGURE 77.4: Simplified block diagram of the von Neuman architecture
architecture of the device is immediately apparent from the separate program and data buses What
is not apparent is that the architecture has been modified to permit communication between thetwo buses Through such communication, it is possible to transfer data between the program andmemory spaces Then, the program memory space also can be used to store tables The transfertakes place by using special instructions such as TBLR (Table Read), TBLW (Table Write), and BLKP(Block transfer from Program memory)
As shown in the block diagram, the program ROM is linked to the program bus, while data RAMblocks B1 and B2 are linked to the data bus The RAM block B0 can be configured either as program
or data memory (using the instructions CNFP and CNFD), and it is multiplexed with both buses.The different segments, such as the multiplier, the ALU, the memories, etc are examined in moredetail below
77.3 TMS320C25 Memory Organization and Access
Besides the on-chip memory (RAM and ROM), the TMS320C25 can access external memory throughthe external bus This bus consists of the 16 address pins A0-A15, and the 16 data pins D0-D15.The address pins carry the address to be accessed, while the data pins carry the instruction word orthe operand, depending on whether program or data memory is accessed The bus can access eitherprogram or data memory, the difference indicated by which of the pins PS and DS (with overbars)becomes active The activation is done automatically when, during the execution, an instruction or
a piece of data needs to be fetched Since the address is 16-bits wide, the maximum memory space
Trang 9FIGURE 77.6: Memory maps for program and data memory of the TMS320C25.
is 64K words for program and 64K words for data
The device starts execution after a reset signal, i.e., after the RS pin is pulled low for a shortperiod of time The execution always begins at program memory location 0, where there should
be an instruction to direct the program execution to the appropriate location This direction isaccomplished by a branch instruction
B PROG
which loads the program counter with the program memory address that has the label PROG (orany other label you choose) Then, execution continues from the address PROG, where, presumably,
a useful program has been placed
It is clear that the program memory location 0 is very important, and you need to know where
it is physically located The TMS320C25 gives you the flexibility to use as location 0 either the firstlocation of the on-chip ROM, or the first location of the external memory In the first case, we say thatthe device operates in the microcomputer mode, while in the second one it is in the microprocessormode In the microprocessor mode, the on-chip ROM is ignored altogether You can choose betweenthe two modes by pulling the device MP/MC high or low The microcomputer mode is useful forproduction purposes, while for laboratory and development work the microprocessor mode is usedexclusively
Figure77.6shows the memory configuration of the TMS320C25, where the microprocessor andmicrocomputer configurations of the program memory are depicted separately The data memory
is partitioned in 512 sections, called pages, of 128 words each The reason of the partitioning is foraddressing purposes, as will be discussed below Memory boundaries of the 64K memory space areshown in both decimal and hexadecimal notation (hexadecimal notation indicated by an “h” or “H”
at the end.) Compare this map with the block diagram in Fig.77.5
As mentioned earlier, in two-operand operations, one of the operands resides in the accumulator,and the result is also placed in the accumulator (The only exceptions is the multiplication operationexamined later.) The other operand can either reside in memory or be part of the instruction In thelatter case, the value to be combined with the accumulator is explicitly specified in the instruction, andthis addressing mode is called immediate addressing mode In the TMS320C25 assembly language,the immediate addressing mode instructions are indicated by a “K” at the end of the instruction
Trang 10For example, the instruction
ADDK 5
increments the contents of the accumulator by 5
If the value to be operated upon resides in memory, there are two ways to access it: either byspecifying the memory address directly (direct addressing) or by using a register that holds theaddress of that number (indirect addressing)
As a general rule, it is desirable to describe an instruction as briefly as possible so that the wholedescription can be held in one 16-bit word Then, when the program is executed, only one wordneeds to be fetched before all the information from the instruction is available for execution This
is not always possible and there are two-word instructions as well, but the chip architects alwaysstrive to achieve one-word instructions In the direct addressing mode, full description of a memoryaddress would require a 16-bit word by itself because the memory space is 64K words To reducethat requirement, the memory space is divided in 512 pages of 128 words each An instruction usingdirect addressing contains the 7 bits indicating what word you want to access within a page Thepage number (9 bits) is stored in a separate register (actually, part of a register), called the Data Pagepointer (DP) You store the page number in the DP pointer by using the instructions LDP (Load DataPage pointer) or LDPK (Load Data Page pointer immediate)
In the indirect addressing mode, the data memory address is held in a register that acts as a memorypointer There are eight such registers available, called auxiliary registers, AR0-AR7 The auxiliaryregisters can also be used for other functions, such as loop counters, etc To save bits in the instruction,the auxiliary register used as memory pointer is not indicated explicitly, but it is stored in a separateregister (actually, part of a register), the auxiliary register pointer (ARP) In other words, there isthe concept of the “current register” In an operation using indirect addressing, the contents of thecurrent auxiliary register point to the desired memory location The current AR is specified by thecontents of the ARP as shown in Fig.77.7 In an instruction, indirect addressing is indicated by anasterisk
FIGURE 77.7: Example of indirect addressing mode
A “+” sign at the end of an instruction using indirect addressing means “after the present memoryaccess, increment the contents of the current auxiliary register by 1” This is done in parallel withthe load-accumulator operation The above autoincrementing of the auxiliary register is an optionaloperation that offers additional flexibility to the programmer And it is not the only one available.The TMS320C25 has an auxiliary register arithmetic unit (ARAU, see Fig.77.5) that can execute
Trang 11such operations in parallel with the CPU, and increase the throughput of the device in this way.Table77.2summarizes the different operations that can be done while using indirect addressing.
As seen from this table, the contents of an auxiliary register can be incremented or decremented by
1, incremented or decremented by the contents of AR0, and incremented or decremented by AR0
in a bit-reversed fashion The last operation is useful when doing Fast Fourier Transforms Thebit-reversed addressing is implemented by adding AR0 with reverse carry propagation, an operationexplained in the TMS320C25 User’s Guide Additionally, it is possible to load at the same time theARP with a new value, thus saving an extra instruction
TABLE 77.2 Operations That Can Be Performed in Parallel with Indirect Addressing
Notation Operation ADD ∗ No manipulation of AR or ARPADD ∗, Y Y→ ARP
ADD ∗ + AR(ARP)+1 → AR(ARP) ADD ∗ +,Y AR(ARP)+1 → AR(ARP)
Y → ARP ADD ∗- AR(ARP) - 1→ AR(ARP) ADD ∗-,Y AR(ARP) - 1→ AR(ARP)
Y → ARP ADD ∗0+ AR(ARP) + AR0 → AR(ARP) ADD ∗0+,Y AR(ARP) + AR0 → AR(ARP)
Y → ARP ADD ∗0- AR(ARP)-AR0→ AR(ARP) ADD ∗0-,Y AR(ARP)-AR0→ AR(ARP)
Y → ARP ADD ∗ BR0+ AR(ARP) +rcAR0 → AR(ARP) ADD ∗ BR0+,Y AR(ARP) +rcAR0 → AR(ARP)
Y → ARP ADD ∗BR0- AR(ARP)-rcAR0→ AR(ARP) ADD ∗BR0-,Y AR(ARP)-rcAR0→ AR(ARP)
Y → ARP
Note: Y = 0, , 7 is the new “current” AR AR(ARP)
is the AR pointed to by the ARP BR = bit reversed, rc
= reverse carry.
77.4 TMS320C25 Multiplier and ALU
The heart of the TMS320C25 is the CPU consisting, primarily, of the multiplier and the arithmeticlogic unit (ALU) The hardware multiplier can perform a 16 bit× 16 bit multiplication in a singlemachine cycle This capability is probably the major distinguishing feature of digital signal processorsbecause it permits high throughput in numerically intensive algorithms
Associated with the multiplier, there are two registers that hold operands and results The register (for temporary register) holds one of the two factors The other factor comes from a memorylocation Again, this construct, with one implied operand residing in the T-register, permits morecompact instruction words When multiplier and multiplicand (two 16-bit words) are multipliedtogether, the result is 32-bits long In traditional microprocessors, this product would have beentruncated to 16 bits, and presented as the final result In DSP applications, though, this product
T-is only an intermediate result in a long stream of multiply-adds, and if truncated at thT-is point, toomuch computational noise would be introduced to the final result To preserve higher final accuracy,the full 32-bit result is held in the P-register (for product register) This configuration is shown inFig.77.8which depicts the multiplier and the ALU of the TMS320C25
Actually, the P-register is viewed as two 16-bit registers concatenated This viewpoint is convenient
Trang 12if you need to save the product using the instructions SPH (store product high) and SPL (storeproduct low) Otherwise, the product can operate on the accumulator, which is also 32-bits wide.The contents of the product register can be loaded on the accumulator, overwriting whatever wasthere, using the PAC (product to accumulator) instruction It can also be added to or subtractedfrom the accumulator using the instructions APAC or SPAC.
FIGURE 77.8: Diagram of the TMS320C25 multiplier and ALU
When moving the contents of the T-register to the accumulator, you can shift this number using thebuilt-in shifters For instance you can shift the result left by 1 or 4 locations (essentially multiplying
it by 2 or 16), or you can shift it right by 6 (essentially dividing it by 64) These operations are doneautomatically, without spending any extra machine cycles, simply by setting the appropriate productmode with SPM instruction Why would you want to do such shifting? The left shifts have as amain purpose to eliminate any extra sign bits that would appear in computations The right shiftscales down the result and permits accumulation of several products before you start worrying aboutoverflowing the accumulator
At this point, it is appropriate to discuss the data formats supported on the TMS320C25 Thisdevice, as most fixed-point processors, uses two’s-complement notation to represent the negativenumbers In two’s complement notation, to form the negative of a given number, you take thecomplement of that number and you add 1 In two’s-complement notation, the most significant bit(MSB, the left-most bit) of a positive number is zero, while the MSB of a negative number is one Inthe ‘C25, the two’s complement numbers are sign-extended, which means that, if the absolute value
of the number is not large enough to fill all the bits of the word, there will be more than one sign bits
As seen from Fig.77.8, the multiplier path is not the only way to access the accumulator Actually,the ALU and the accumulator support a wealth of arithmetic (ADD, SUB, etc.) and logical (OR,AND, XOR, etc.) instructions, in addition to load and store instructions for the accumulator (LAC,
Trang 13FIGURE 77.9: Partial memory configuration of the TMS320C25 after the CNFD and the CNFPinstructions.
ZALH, SACL, SACH, etc.)
An interesting characteristic of the TMS320C25 architecture is the existence of several shifters thatcan perform such shifts in parallel with other operations Except for the right shifter at the multiplier,all the other shifters are left shifters An input shifter to the ALU and the accumulator can shift theinput value to the left by up to 16 locations, while output shifters from the accumulator can shifteither the high or the low part of the accumulator by up to 7 locations to the left
A construct that appears very often in mathematical computations is the sum of products Sums
of products appear in the computation of dot products, in matrix multiplication, and in convolutionsums for filtering, among other applications Since it is important to carry out this computation asfast as possible for real-time operation, all digital signal processors have special instructions to speed
up this particular function
The TMS320C25 has the instruction LTA which loads the T-register and, in parallel with that, addsthe previous product (which already resides in the P-register) to the accumulator LTS subtracts theproduct from the accumulator Another instruction, LTD, does the same thing as LTA, but it alsomoves the value that was just loaded on the T-register to the next higher location in memory Thismove realizes the delay line that is needed in filtering applications LTA, when combined with theMPY instruction, can implement very efficiently the sum of products
For even higher efficiency, there is a MAC instruction that combines LTA and MPY An additionalMACD instruction combines LTD and MPY The increased efficiency is achieved by using both thedata and the program buses to bring in the operands of the multiplication The data coming fromthe data bus can be traced in memory by an AR, using indirect addressing The data coming from theprogram bus are traced by the program counter (actually, the pre-fetch counter, PFC) and, hence,they must reside in consecutive locations of program memory To be able to modify the data andthen use it in such multiply-add operations, the TMS320C25 permits reconfiguration of block B0
in the on-chip memory B0 can be configured either as program or as data memory, as shown inFig.77.9, using the CNFD and CNFP instructions
Trang 1477.5 Other Architectural Features of the TMS320C25
The TMS320C25 has many interesting features and capabilities that can be found in the user’sguide [1] Here, we present briefly only the most important of them
The program counter is a 16-bit register, hidden from the user, which contains the address ofthe next instruction word to be fetched and executed Occasionally, the program execution may beredirected, for instance, through a subroutine call In this case, it is necessary to save the contents
of the program counter so that the program flow continues from the correct instruction after thecompletion of the subroutine call For this purpose, a hardware stack is provided to save and recoverthe contents of the program counter
The hardware stack is a set of eight registers, of which only the top one is accessible to the user.Upon a subroutine call, the address after the subroutine call is pushed on the stack, and it is reinstated
in the program counter when the execution returns from the subroutine call The programmer hascontrol over the stack by using the PUSH, PSHD, POP, and POPD instructions The PUSH andPOP operations push the accumulator on the stack or pop the top of the stack to the accumulatorrespectively PSHD and POPD do the same functions but with memory locations instead of theaccumulator
Occasionally the program execution in a processor must be interrupted in order to take care
of urgent functions, such as receiving data from external sources In these cases, a special signalgoes to the processor, and an interrupt occurs The interrupts can be internal or external During
an interrupt, the processor stops execution, wherever it may be, pushes the address of the nextinstruction on the stack, and starts executing from a predetermined location in memory Theinterrupt approach is appropriate when there are functions or devices that need immediate attention
On the TMS320C25, there are several internal and external interrupts, which are prioritized, i.e., whenseveral of the interrupts occur at the same time, the one with the highest priority is executed first.Typically, the memory location where the execution is directed to during an interrupt contains abranch instruction This branch instruction directs the program execution to an area in the programmemory where an interrupt service routine exists The interrupt service routine will perform the tasksthat the interrupt has been designed for, and then return to the execution of the original program.Besides the external hardware interrupts (for which there are dedicated pins on the device), thereare internal interrupts generated by the serial port and the timer The serial port provides directcommunication with serial devices, such as codecs, serial analog-to-digital converters, etc In thesedevices, the data are transmitted serially, one bit at a time, and not in parallel, which would requireseveral parallel lines When 16 bits have been input, the 16-bit word can be retrieved from the registerDRR (data receive register) Conversely, to transmit a word, you put it in the DXR (data transmitregister) These two registers occupy data memory locations 0 and 1, respectively, and they can betreated like any other memory location
The timer consists of a period register and a timer register At the beginning of the operation, thecontents of the period register are loaded on the timer register, which is then decremented at everymachine cycle When the value of the timer register reaches zero, it generates a timer interrupt, theperiod register is loaded again on the timer register, and the whole operation is repeated
77.6 TMS320C25 Instruction Set
The TMS320C25 has an instruction set consisting of 133 instructions Some of these assemblylanguage instructions perform general purpose operations, while others are more specific to DSPapplications This section discusses examples of instructions selected from different groups For adetailed description of each instruction, the reader is referred to the TMS320C25 User’s Guide [1].Each instruction is represented by one or two 16-bit words Part of the instruction is a unique code
Trang 15identifying the operation to be performed, while the rest of the instruction contains information onthe operation For instance, this additional information determines if direct or indirect addressing isused, if there is a shift of the operand, what is the address of the operand, etc In the case of two-wordinstructions, the second word is typically a 16-bit constant or program memory address As it should
be obvious, a two-word instruction takes longer to execute because it has to fetch two words, and itshould be avoided if the same operation could be accomplished with a single-word instruction.For example, if you want to load the accumulator with the contents of the memory location 3FH,shifting it to the left by 8 locations at the same time, you can write the instruction
Below, some of the more typical instructions are listed, and the ones that have an importantinterpretation are discussed It is a good idea to review carefully the full set of instructions so thatyou know what tools you have available to implement any particular construct The instructions aregrouped here by functionality
The accumulator and memory reference instructions involve primarily the ALU and the tor Note that there is a symmetry in the instruction set The addition instructions have counterpartsfor subtraction, the direct and indirect-addressing instructions have complementary immediate in-structions, and so on
accumula-ABS Absolute value of accumulator ADD Add to accumulator with shift ADDH Add to high accumulator ADDK Add to accumulator short immediate AND Logical AND with accumulator LAC Load accumulator with shift SACH Store high accumulator with shift SACL Store low accumulator with shift SUB Subtract from accumulator with shift SUBC Subtract conditionally
ZAC Zero accumulator ZALH Zero low accumulator and load high accumulator.
Operations involving the accumulator have versions affecting both the high part and the low part
of the accumulator This capability gives additional flexibility in scaling, logical operations, anddouble-precision arithmetic
For example, let location A contain a 16-bit word that you want to scale down dividing by 16, andstore the result in B The following instructions perform this operation:
LAC A,12 ; Load ACC with A shifted by 12 locations SACH B ; Store ACCH to B:B = A/16
The auxiliary registers and data page pointer instructions deal with loading, storing, and modifyingthe auxiliary registers and the data page pointer Note that the auxiliary registers and the ARP can also
be modified during operations using indirect addressing Since this last approach has the advantage
of making the modifications in parallel with other operations, it is the most common method of ARmodification
LAR Load auxiliary register LARP Load auxiliary register pointer LDP Load data memory page pointer MAR Modify auxiliary register SAR Store auxiliary register
Trang 16The multiplier instructions are more specific to signal processing applications.
APAC Add P-register to accumulator
LT Load T-register LTD Load T-register, accumulate previous product, and move data MAC Multiply and accumulate
MACD Multiply and accumulate with data move MPY Multiply
MPYK Multiply immediate PAC Load accumulator with P-register SQRA Square and accumulate
Note that the instructions that perform multiplication and accumulation at the same time do notaccumulate the present product but the result of an earlier multiplication This result is found in theP-register The square and accumulate function, SQRA, is a special case of the multiplication thatappears often enough to prompt the inclusion of this specific instruction
The branch instructions correspond to the GOTO instruction of high-level languages Theyredirect the flow of the execution either unconditionally or depending on some previous result
B Branch unconditionally BANZ Branch on auxiliary register non zero BGEZ Branch if accumulator>= 0
CALA Call with subroutine address in the accumulator CALL Call subroutine
RET Return from subroutine
The CALL and RET instructions go together because the first one pushes the return address onthe stack, while the second one pops the address from the stack into the program counter TheBANZ instruction is very helpful in loops where an AR is used as a loop counter BANZ tests the AR,modifies it, and branches to the indicated address
The I/O operations are, probably, among the most important in terms of final system configuration,because they help the device interact with the rest of the world Two instructions that perform thatfunction are the IN and OUT instructions
BLKD Block move from data memory to data memory
IN Input data from port OUT Output data to port TBLR Table read TBLW Table write
The IN and OUT instructions read from or write to the 16 input and the 16 output ports of theTMS320C25 Any transfer of data goes to a specified memory location The BLKD instruction permitsmovement of data from one memory location to another without going through the accumulator Tomake such a movement effective, though, it is recommended to use BLKD with a repeat instruction,
in which case every data move takes only one cycle
The TBLR and TBLW instructions represent a modification to the Harvard architecture of thedevice Using them, data can be moved between the program and the data spaces In particular, ifany tables have been stored in the program memory space they can be moved to data memory beforethey can be used That is how the terminology of the instructions originated
Some other instructions include:
DINT Disable interrupts EINT Enable interrupts IDLE Idle until interrupt RPT Repeat instruction as specified by data memory value RPTK Repeat instruction as specified by immediate value
Trang 1777.7 Input/Output Operations of the TMS320C25
During program execution on a digital signal processor, the data is moved between the differentmemory locations, on-chip and off-chip, as well as between the accumulator and the memory lo-cations This movement is necessary for the execution of the algorithm that is implemented on theprocessor However, there is a need to communicate with the external world in order to receive datathat will be processed, and return the processed results
Devices communicate with the external world through their external memory or through the serialand parallel ports Such a communication can be achieved, for instance, by sharing the externalmemory Most often, the communication with the external world takes place through the externalparallel or serial ports that the device has Some devices may have ports of only one kind, serial orparallel, but most modern processors have both types The two kinds of ports differ in the way inwhich the bits are read In a parallel port, there is a physical line (and a processor pin) dedicated toevery bit of a word For example, if the processor reads in words that are 16 bits wide, as is the casewith the TMS320C25, it has 16 lines available to read a whole word in a single operation Typically,the same pins that are used for accessing external memory are also used for I/O
The TMS320C25 has 16 input and 16 output ports that are accessed with the IN and OUT tions These instructions transfer data between memory locations and the I/O port specified
instruc-77.8 Subroutines, Interrupts, and Stack on the TMS320C25
When writing a large program, it is advisable to structure it in a modular fashion Such modularity
is achieved by segmenting the program in small, self-contained tasks that are encoded as separateroutines Then, the overall program can be simply a sequence of calls to these subroutines, possiblywith some“glue” code Constructing the program as a sequence of subroutines has the advantage that
it produces a much more readable algorithm that can greatly help in debugging and maintaining it.Furthermore, each subroutine can be debugged separately, which is far easier than trying to uncoverprogramming errors in a “spaghetti-code” program
Typically, the subroutine is called during the program execution with an instruction such asCALL SUBRTN
where SUBRTN is the address where the subroutine begins In this example, SUBRTN would bethe label of the first instruction of the subroutine The assembler and the linker resolve what theactual value is Calling a subroutine has the following effects:
• Increments the program counter (PC) by one and pushes its contents on the top of thestack (TOS) The TOS now contains the address of the instruction to be executed afterreturning from the subroutine
• Loads the address SUBRTN on the PC
• Starts execution from where the PC is pointing at (i.e., from location SUBRTN)
At the end of the subroutine execution, a return instruction (RET) will pop the contents of the top
of the stack on the program counter, and the program will continue execution from that location.The stack is a set of memory locations where you can store data, such as the contents of the PC Thedifference from regular memory is that the stack keeps track of the location where the most recentdata was stored This location is the TOS The stack is implemented either in hardware or software.The TMS320C25 has a hardware stack that is eight locations deep When a piece of data is put(“pushed”) on the stack, everything already there is moved down by one location Notice that thecontents of the last location (bottom of the stack) are lost Conversely, when a piece of data is retrievedfrom the stack (it is “popped”), all the other locations are moved up by one location Pushing andpopping always occur at the top of the stack
Trang 18The interrupt is a special case of subroutine The TMS320C25 supports interrupts generated eitherinternally or from external hardware An interrupt causes a redirection of the program execution inorder to accomplish a task For instance, data may be present at an input port, and the interrupt forcesthe processor to go and “service” this port (inputting the data) As another example, an externalD/A converter may need a sample from the processor, and it uses an interrupt to indicate to the DSPdevice that it is ready to receive the data As a result, when the processor is interrupted, it “knows”
by the nature of the interrupt that it has to go and do a specific task, and it does just that
The performance of the designated task is done by the interrupt service routine (ISR) An ISR islike a subroutine with the only difference on the way it is accessed, and in the functions performedupon return When an interrupt occurs, the program execution is automatically redirected to specificmemory locations, associated with each interrupt As explained earlier, the TMS320C25 continuesexecution from a specified memory location which, typically, contains a branch instruction to theactual location of the interrupt service routine
The return from the interrupt service routine, like in a subroutine, pops the top of the stack to theprogram counter However, it has the additional effect of re-enabling the interrupts This is necessarybecause when an interrupt is serviced, the first thing that happens is that all interrupts are disabled
to avoid confusion from additional interrupts Re-enabling is done explicitly in the TMS320C25 (byusing the EINT command)
77.9 Introduction to the TMS320C30 Digital Signal Processor
The Texas Instruments TMS320C30 is a floating-point processor that has some commonalities withthe TMS320C25, but that also has a lot of differences The differences are due more to the fact thatthe TMS320C30 is a newer processor than that it is a floating-point processor The TMS320C30 is afast, 32-bit, digital signal processor that can handle both fixed-point and floating-point operations.The speed of the device is 16.7 MHz, which corresponds to a cycle time of 60 ns Since the majority ofthe instructions execute in a single cycle (after the pipeline is filled), the figure of 60 ns also indicateshow long it takes to execute one instruction Alternatively, we can say that the device can execute16.7 MIPS Another figure of merit is based on the fact that the device can perform a floating-pointmultiplication and addition in a single cycle Then, it is said that the device has a (maximum)throughput of 33 million floating-point operations per second (MFLOPS)
The actual signal from the external oscillator or crystal has a frequency twice that of the internaldevice speed, at 33.3 MHz (and period of 30 ns) This frequency is then divided on-chip to generatethe internal clock with a period of 60 ns Newer versions of the TMS320C30 and other members ofthe ‘C3x generation operate in higher frequencies
Figure77.10shows in a simplified form the key features of the TMS320C30 The major parts of theDSP processor are the memory, the CPU, the peripherals, and the direct memory access (DMA) unit.Each of these parts will be examined in more detail later in this article The on-chip memory consists
of 2K words of RAM and 4K words of ROM There is also a 64-word long program cache Eachword is 32-bits wide and the memory sizes for the TMS320C30 are measured in 32-bit words, andnot in bytes The memory (RAM or ROM) can be used to store either program instructions or data.This presents a departure from the practice of separating the two spaces that the TMS320C25 uses,combining features of a von Neuman architecture with a Harvard architecture Overall, the devicecan address 16 M words of memory through two external buses Except for what resides on-chip, therest of the memory is external, supplied by the designer
The CPU is the heart of the processor It has a hardware multiplier that is capable of performing amultiplication in a single cycle The multiplication can be between two 32-bit floating point numbers,
or between two integers To achieve a higher intermediate accuracy of results, the product of twofloating-point numbers is saved as a 40-bit result In integer multiplication, two 24-bit numbers are