A general-purpose DSP is a software-programmable integrated circuit used for Figure 10-l Hardware Components of a Digital Signal Processing System 533 Digital Communication Receivers: Sy
Trang 1Chapter 10 DSP System Implementation
This chapter is concerned with the implementation of digital signal processing systems It serves the purpose to make the algorithm designer aware of the strong interaction between algorithm and architecture design
Digital signal processing systems are an assembly of heterogeneous hardware components The functionality is implemented in both hardware and software subsystems A brief overview of DSP hardware technology is given in Section 10.1 Design time and cost become increasingly more important than chip cost
A look at hardware-software co-design is done in Section 10.2 Section 10.3 is devoted to quantization issues In Sections 10.4 to 10.8 an ASIC (application- specific integrated circuit) design of a fully digital receiver is discussed We describe the design flow of the project, the receiver structure, and the decision making for its building blocks The last two sections are bibliographical notes on Viterbi and Reed-Solomon decoders
Digital signal processing systems are an assembly of heterogeneous subsys- tems The functionality is implemented in both hardware and software subsystems Commonly found hardware blocks are shown in Figure 10-l
There are two basic types of processors available to the designer: a pro- grammable general-purpose digital signal processor (DSP) or a microprocessor
A general-purpose DSP is a software-programmable integrated circuit used for
Figure 10-l Hardware Components of a Digital Signal Processing System
533
Digital Communication Receivers: Synchronization, Channel Estimation, and Signal Processing
Heinrich Meyr, Marc Moeneclaey, Stefan A Fechtel Copyright 1998 John Wiley & Sons, Inc Print ISBN 0-471-50275-8 Online ISBN 0-471-20057-3
Trang 2speech coding, modulation, channel coding, detection, equalization, and associ- ated modem tasks such as frequency, symbol timing and phase synchronization
as well as amplitude control Moreover, a DSP is preferably used with regard
to flexibility in applications and ability to add new features with minimum re- design and re-engineering Microprocessors are usually used to implement pro- tocol stacks, system software, and interface software Microprocessors are better suited to perform the non-repetitive, control-oriented input/output operations as well as all housekeeping chores
ASICs are used for various purposes They are utilized for high-throughput tasks in the area of digital filtering, synchronization, equalization, and channel decoding An ASIC is often likely to provide also the glue logic to interface components In some systems, the complete digital receiver is implemented as
an ASIC, coupled with a microcontroller ASICs have historically been used because of their lower power dissipation per function In certain applications like spread-spectrum communications, digital receiver designs require at least partial ASIC solutions in order to execute the wideband processing functions such as despreading and code synchronization This is primarily because the chip-rate processing steps cannot be supported by current general-purpose DSPs
Over the last few years, as manufacturers have brought to market first- and second-generation digital cellular and cordless solutions, programmable general- purpose digital signal processors are slowly being transformed into “accelerator- assisted DSP-microcontroller” hybrids This transformation is a result of the severe pressure being put on reducing the power consumption As firmware solutions become finalized, cycle-hungry portions of algorithms (e.g., equalizers) are being
“poured into silicon”, using various VLSI architectural ideas This has given rise to, for example, new DSPs with hardware accelerators for Viterbi decoding, vectorized processing, and specialized domain functions The combination of programmable processor cores with custom data-path accelerators within a single chip offers numerous advantages: performance improvements due to time-critical computations implemented in accelerators, reduced power consumption, faster internal communication between hardware and software, field programmability due to the programmable cores and lower total system cost due to a single DSP chip solution Such core-based ASIC solutions are especially attractive for portable applications typically found in digital cellular and cordless telephony, and they are likely to become the emerging solution for the foreseeable future
If a processor is designed by jointly optimizing the architecture, the instruc- tion set and the programs for the application, one speaks of an application-specific integrated processor (ASIP) The applications may range from a small number
of different algorithms to an entire signal processing application A major draw- back of ASIPs is that they require an elaborate support infrastructure which is economically justifiable only in large-volume applications
The decision to implement an algorithm in software or as a custom data-path (accelerator) depends on many issues Seen from a purely computational power
Trang 310.1 Digital Signal Processing Hardware 535
Figure 10-2 Complexity versus Signal Bandwidth Plot
point-of-view, algorithms can be categorized according to the two parameters signal bandwidth and number of operations per sample The first parameter defines a measure of the real-time processing requirement of the algorithm The second provides a measure of complexity of the algorithm In Figure 10-2 we have plotted complexity versus bandwidth on a double logarithmic scale A straight line in the graph corresponds to a processing device that performs a given number
of instructions per second Applications in the upper-right corner require massive parallel processing and pipelining and are the exclusive domain of ASICs In contrast, in the lower-left corner the signal bandwidth is much smaller than the clock rate of a VLSI chip Hence, hardware resources can be shared and the programmable processor is almost always the preferred choice For the region between these two extremes resource sharing is possible either using a processor
or an ASIC There are no purely computational power arguments in favor of either one of the two solutions The choice depends on other issues such as time-to- market and capability profile of the design team, to mention two examples The rapid advance of microelectronic is illustrated by Figure 10-3 The complexity of VLSI circuits (measured in number of gates) increases tenfold every
6 years This pattern has been observed for memory components and general- purpose processors over the last 20 years and appears to be true also for the DSP The performance measured in MOPS (millions of operations per second) is related
to the chip clock frequency which follows a similar pattern The complexity of software implemented in consumer products increases tenfold every 4 years [ 11
Trang 4Figure 10-3 Complexity of VLSI Circuits
The functionality in a DSP system is implemented in both hardware and software subsystems But even within the software portions there is diversity Control-oriented processes (protocols) have different characteristics than data-flow- oriented processes (e.g., filtering) A DSP system design therefore not only mixes hardware design with software design but also mixes design styles within each of these categories
One can distinguish between two opposing philosophies for system level design [2] One is the unified approach which seeks a consistent semantics for the specification of the complete system The other is a heterogeneous approach which seeks to combine semantically disjoint subsystems For the foreseeable future the latter appears to be the feasible approach In the heterogeneous approach, for example, the design automation tool for modeling and analysis of algorithms is tightly coupled with tools for hardware and software implementation This makes it possible to explore algorithm/architecture trade-offs in a joint optimization process,
as will be discussed in the design case study of Section 10.4
The partitioning of the functionality into hardware and software subsystems
is guided by a multitude of (often conflicting) goals For example, a software implementation is more flexible than a hardware implementation because changes
in the specification are possible in any design phase On the negative side we mentioned the higher power consumption compared to an ASIC solution which
is a key issue in battery-operated terminals Also, for higher volumes an ASIC
is more cost effective
Design cost and time become increasingly more important than chip process- ing costs On many markets product life cycles will be very short To compete
Trang 510.3 Quantization and Number Representation 537
successfully, companies will need to be able to turn system concepts into silicon
quickly This puts high priority on computerized design methodology and tools in
order to increase the productivity of engineering design teams
In this section we discuss the effect of finite word-lengths in digital signal
processing There are two main issues to be addressed First, when sampling was
considered so far, we assumed that the samples were known with infinite precision,
which of course is impossible Each sample must therefore be approximated by
a binary word The process where a real number is converted to a finite binary
word is called quuntizution
Second, when in digital processing the result of an operation contains more
bits than can be handled by the process downstream, the word length must be
reduced This can be done either by rounding, truncation, or clipping
For further reading, we assume that the reader is familiar with the basics of
binary arithmetics As a refresher we suggest Chapter 9.0 to 9.2 of the book by
Oppenheim and Schafer [3]
A quantizer is a zero-memory nonlinear device whose output Zout is related
to the input x in according to
xout = qi if Xi<Xin<Xi+l (10-l) where qi is an output number that identifies the input interval [xi, xi+1 ),
Uniform quantization is the most widely used law in data signal processing and
the only one discussed here
All uniform quantizer characteristics have the staircase form shown in Figure
10-4 They differ in the number of levels, the limits of operation, and the location
2Aq
Ag -4Ax -3Ax -2Ax -Ax
0,ll
2f4 II4
0 -1 -314 -2l4 -l/4
Figure 10-4 Uniform Quantizer Characteristic with b= 3 Bit Rounding to the
nearest level is employed The binary number is interpreted as 2’s complement
Trang 6of the origin Every quantizer I has a finite range that extends between
limits xrnin) xmax- Any input value exceeding the limits is clipped:
Aq as an integer or as a binary fraction 2-b
A quantizer exhibits small-scale nonlinearity within its individual steps and large-scale nonlinearity if operated in the saturation range The amplitude of the input signal has to be controlled to avoid severe distortion of the signal from either nonlinearity The joint operation of the analog-to-digital (A/D) converter and the AGC is of crucial importance to the proper operation of any receiver The input amplitude control of the A/D converter is known as loading adjustment
Quantizer characteristics can be categorized as possessing midstep or midriser staircases, according to their properties in the vicinity of zero input Each has its own advantages and drawbacks and is encountered extensively in practice
In Fig 10-4 a mid-step characteristic with an even number of levels L = 23
is shown The characteristic is obtained by rounding the input value to the nearest quantization level A 2’s complement representation of the binary numbers is used
in this example The characteristic exhibits a dead zone at the origin When an error detector possesses such a dead zone, the feedback loop tends to instability This characteristic is thus to be avoided in such applications The characteristic
is asymmetric since the number - 1 is represented but not the number +l If the quantizer therefore operates in both saturation modes, then it will produce a nonzero mean output despite the fact that the input signal has zero mean The characteristic can easily be made symmetric by omitting the most negative value
A different characteristic is obtained by truncation Truncation is the operation which chooses the largest integer less than or equal (xin/Ax) For example, xin/Ax = 0.8 we obtain INT(0.8) = 0 But for xin/Ax = -3.4 we obtain INT( -3.4) = -4 In Figure 10-5 the characteristic obtained by truncation is shown A 2’s complement representation of the binary numbers is used
This characteristic is known as ofiet quantizer3 Since it is no longer symmetric, it will bias the output signals even for small signal amplitudes This leads to difficulties in applications where a zero mean output is required for a zero mean input signal
A midriser characteristic is readily obtained from the truncated characteristics
in Fig 10-5 by increasing the word length of the quantizer output by 1 bit and choosing the LSB (least significant bit) identical 1 for all words (Figure 10-6)
3 In practical A/D converters the origin can be shifted by adjusting an input offset voltage
Trang 710.3 Quantization and Number Representation 539
Figure 10-5 Offset Quantizer Employing Truncation b=3;
binary number interpreted as 2’s complement
Two's compl
314 2/4 l/4
0 -1 -3/4 -2l4 -114
Notice that the extra bits need not to be produced in the physical A/D converter but can be added in the digital processor after AD conversion
The midriser characteristic is symmetric and has no dead zone around zero input A feedback loop with a discontinuous step at zero in its error detector will dither about the step which is preferable to limit cycles induced by a dead zone
0,ll 0,lO
0,oo l,oo 1,Ol 1,lO 1,ll
7/a
518
3/a
118 -718 -518 -318 -ii8
Figure 10-6 Midriser Characteristic Obtained by Adding
an Additional Bit to the Quantizer Output
Trang 83/4 2l4 l/4
0 -314 -2l4 -l/4 -0
Figure 10-7 Quantizer Characteristic Obtained by Magnitude Truncation
The effect of the number representation on the quantizer characteristic is illustrated in Figure 10-7 In some applications it is advantageous to employ a sign-magnitude representation
Figure 10-7 shows the characteristic by truncation of the magnitude of an input signal This is no longer a uniform quantizer, the center interval has double width When the result of an operation contains more bits than can be handled downstream, the result must be shortened The effect of rounding, truncation,
or clipping is different for the various number representations
The resulting quantization characteristic is analogous to that obtained earlier with the exception that now both input and output are discretized However, quantizing discrete values is more susceptible to causing biases than quantizing continuous values Thus quantizing discrete values should be performed even more carefully
10.4 ASIC Design Case Study
In this case study we describe the design of a complete receiver chip for digital video broadcasting over satellite (DVB-S)[4] The data rate of DVB is in the order of 40 Msymbols/s The chip was realized in 0.5 p CMOS technology with a (maximum) clock frequency of 88 MHz The complexity of operation and the symbol rate locates it in the right upper corner of the complexity versus bandwidth plot of Figure 10-2 We outline the design flow of the project, the receiver structure, and the rationale of the decision making for its building blocks
Trang 910.4 ASIC Design Case Study 541 10.4.1 Implementation Loss
In an ASIC realization the chip area is in a first approximation proportional to the word length Since the cost of a chip increases rapidly with the area, choosing quantization parameters is a major task
The finite word length representation of numbers in a properly designed digital receiver ideally has the same effect as an additional white noise term The resulting decrease of the signal-to-noise ratio is called the implementation loss A second degradation with respect to perfect synchronization is caused by the variance of the synchronization parameter estimates and was called detection loss (see Chapter
7) The sum of these two losses, D total, is the decrease of the signal-to-noise ratio with respect to a receiver with perfect synchronization and perfect implementation
It is exemplarily shown in Figure 10-8 for an 8-PSK trellis coded modulation [5] The left curve shows the BER for a system with perfect synchronization and infinite precision arithmetics while the dotted line shows the experimental results
It is seen that the experimental curve is indeed approximately obtained by shifting the perfect system performance curve by Dtotal to the right
Quantization is a nonlinear operation It exhibits small-scale nonlinearity
in its individual steps and large-scale nonlinearity in the saturation range Its effect depends on the specific algorithm, it cannot be treated in general terms
10-g
10-e
, m Trcll s-Coda (unquontitsd)84~“\ D \,
’ DIRKS, excaerimsntal results ‘&oto’
SNR (ES/NO)
Figure 10-S Loss D total of the Experimental Receiver DIRECS
Trang 10In a digital receiver the performance measure of interest is the bit error rate
We are allowed to nonlinearly distort the signal as long as the processed signal represents a sufficient statistics for detection of acceptable accuracy For this reason, quantization effects in digital receivers are distinctly different than in other areas of digital signal processing (such as audio signal processing), which require
a virtually quantization-error free representation of the analog signal
10.4.2 Design Methodology
At this level of complexity, system simulation is indispensable to evaluate the performance characteristics of the system with respect to given design alter- natives The design framework should provide the designer with a flexible and efficient environment to explore the alternatives and trade-offs on different levels
of abstraction This comprises investigations on the
structural level, e.g., joint or separate carrier and timing synchronization algorithmic level, e.g., various estimation algorithms
implementation level, e.g., architectures and word lengths
There is a strong interaction between these levels of abstraction The prin- cipal task of a system engineer is to find a compromise between implementation complexity and system performance Unfortunately, the complexity of the prob- lem prevents formalization of this optimization problem Thus, practical system design is partly based on rough complexity estimates and experience, particularly
at the structural level
Typically a design engineer works hierarchically to cope with the problems of
a complex system design In a first step structural alternatives are investigated The next step is to transform the design into a model that can be used for system simu- lation Based on this simulation, algorithmic alternatives and their performance are evaluated At first this can be done without word length considerations and may already lead to modifications of the system structure The third step comprises developing the actual implementation which requires a largely fixed structure to
be able to obtain complexity estimates of sufficient accuracy At this step bit- true modeling of all imperfections due to limited word lengths is indispensable to assess the final system performance
10.4.3 Digital Video Broadcast Specification
The main points of the DVB standard are summarized in Table lo- 1:
Trang 1110.4 ASIC Design Case Study 543 Table 10-l Outline of the DVB Standard
Modulation
QPSK with Root-Raised Cosine Pulses
(excess bandwidth a = 0.35) and Gray-Encoding
Convolutional Channel Coding
Example for Symbol Rates
20 , “, 44 Msymbols/s The data rate is not specified as a single value but suggested to be within
a range of [ 18 to 681 Mb/s The standard defines a concatenated coding scheme consisting of an inner convolutional code and an outer Reed-Solomon (RS) block code
Figure 10-9 displays the bit error rate versus Eb/No after the convolutional
Trang 12decoder for the two code rates of R = l/2 and R = 7/8 under the assumption
of perfect synchronization and perfect convolutional decoder implementation The output of the outer RS code is supposed to be quasi-error-free (one error per hour) The standard specifies a BER of 2 x 10m4 at &,/No = 4.2 dB for R = l/2 and
at &,/No = 6.15 dB for code rate R = 7/8 This leaves a margin of 1 dB (see Figure 10-9) for the implementation loss of the complete receiver This loss must also take into account the degradation due to the analog front end (AGC, filter, oscillator for down conversion) In the actual implementation the total loss was equally split into 0.5 dB for the analog front end and 0.5 dB for the digital part
10.4.4 Receiver Structure
Figure lo-10 gives a structural overview of the receiver A/D conversion is done at the earliest point of the processing chain The costly analog parts are thus reduced to the minimum radio frequency components Down conversion and sampling is accomplished by free-running oscillators
In contrast to analog receivers where down conversion and phase recovery
is performed simultaneously by a PLL, the two tasks are separated in a digital receiver The received signal is first down converted to baseband with a free- running oscillator at approximately the carrier frequency fc, This leaves a small residual normalized frequency offset R
The digital part consists of the timing and phase synchronization units, the Viterbi and RS-decoder, frame synchronizer, convolutional deinterleaver, descram-
Trang 1310.4 ASIC Design Case Study 545
bler, and the MPEG decoder for the video data A micro controller interacts via 12C bus with the functional building blocks It controls the acquisition process of the synchronizers and is used to configure the chip
10.4.5 Input Quantization
i‘he input signal to the A/D converter comes from a noncoherent AGC (see Volume 1, p 278); the signal-to-noise ratio is unknown to the A/D converter We must consider both large-scale and small-scale quantization effects
An A/D converter can be viewed as a series connection of a soft limiter and
a quantizer with an infinite number of levels We denote the normalized overload level of the limiter by
with
V(Pi) = ce(Pi)
C,(pi) : threshold of the soft limiter
P,: signal power, Pn: noise power
Pi = P,/P,, signal-to-noise ratio of the input signal
Threshold level Cc, interval width Ax, and word length b are related by (Figure 10-11)
Cc + Ax = 2b-1Ax (10-4)
Two problems arise:
1 What is the optimum overload level V(pi)?
2 Since the signal-to-noise ratio is unknown, the sensitivity of the receiver performance with respect to a mismatch of the overload level V(pi) must be determined
b = Number of bits (b = 3)
= Soft limiter threshold
Figure lo-11 A/D Conversion Viewed as Series Connection of
Soft Limiter and Infinitely Extended Quantizer
Trang 14It is instructive to first consider the simple example of a sinusoidal signal plus Gaussian noise We determine V(pi) subject to the optimization criterion (the selection of this criterion will be justified later on)
E[&(xICc, b) - 221” + min (10-5) with &(slC,,b) th e uniform midriser quantizer characteristic with parameters (Cc,b) In eq (10-S) we are thus looking for the uniform quantizer characteristic which minimizes the quadratic error between input signal and quantizer output The result of the optimization task is shown in Figure 10-12 In this figure the overload level is plotted versus pi with word length b as parameter With
PS = A2/2, A: amplitude of the sinusoidal signal, we obtain for high SNR
CdPi) 1/2 4-z A pi >> 1 (10-6) For large word length b we expect that the useful signal passes the limiter undistorted, i.e., Cc(pi)/A N 1 and V(pi) N a For a 4-bit-quantization we obtain V(pi) N 1.26 which is close to 4 The value of V(pg) decreases with decreasing word length The overload level increases at low pi For a sufficiently fine quantization V(pi) becomes larger than the amplitude of the useful signal in order to pass larger amplitude values due to noise
We return to the optimality criterion of eq (10-5) which was selected in order not to discard information prematurely In the present case this implies to pass the input signal amplitude undistorted to the ML decoder It is well known that the
ML decoder requires soft decision inputs for optimum performance The bit error rate increases rapidly for hard-quantized inputs We thus expect minimizing the
Figure lo-12 Optimum Normalized Overload Level V(pi) for
a Sinusoidal Signal plus Gaussian Noise
Parameter is word length b
Trang 1510.4 ASIC Design Case Study 547
8-PSK modulation Trellis encoded
The small scale effects of input quantization are shown exemplarily in Figure lo-13 for an 8-PSK modulation over an additive Gaussian noise channel [S] From this figure we conclude that a 4-bit quantization is sufficient and a 5-bit quantization
is practically indistinguishable from an infinite precision representation
The bit error performance of Figure lo-13 assumes an optimum overload factor V(pi) To determine the sensitivity of the bit error rate to a deviation from the optimum value a computer experiment was carried out The result is shown
in Figure 10-14 The clipping level is normalized to the square root of the signal power, a The BER in Figure lo- 14 is plotted for the two smallest values of Eb/Nc The input signal-to-noise ratio, E, /NO, is related to &,/NO via eq (3-33):
(10-7)
Trang 16m Eb/NO-5.4dB h/T=O.‘OO2, (n+O.O R-7/8
For both input values of Es/No the results are plotted for zero and maximum residual frequency offset, I (RT) Imax = 0.1213 The sampling rate is T,/T = 0.4002
The BER is minimal for a value larger than 1 A design point of
cc
was selected The results show a strong asymmetry with respect to the sign of the deviation from the optimum value Clipping (C,/m < (C,/fil.,t) strongly increases the BER since the ML receiver is fed with hard-quantized input signals which degrades its performance The opposite case, (C,Im > (C,/fi(,,t),
is far less critical since it only increases the quantization noise by increasing the resolution Ax to
(10-9)
Trang 1710.4 ASIC Design Case Study 549
by factor 2
Figure lo-15 Synchronizer Structure
10.4.6 Timing and Phase Synchronizer Structure
The first step in the design process is the selection of a suitable synchronizer structure Timing and phase synchronizer are separated (see Figure 10-H), which avoids interaction between these units From a design point of view this separation
is also advantageous, since it eases performance analysis and thus reduces design time and test complexity An error feedback structure for both units was chosen for the following reasons: video broadcasting data is transmitted as a continuous stream Only an initial acquisition process which is not time-critical has to be performed For tracking purposes error feedback structures are well suited and realizable with reasonable complexity Among the candidate algorithms which were initially considered for timing recovery was the square and filter algorithm (Section 5.4) The algorithm works independently of the phase It delivers an unambiguous estimate, requires no acquisition unit, and is simple to implement This ease of implementation, however, exists only for a known nominal ratio of
T/T8 = 4 Since the ratio T/T8 is only known to be in the interval [2; 2.51, the square and filter algorithm is ruled out for this application
10.4.7 Digital Phase-Locked Loop (DPLL) for Phase Synchronization
The detailed block diagram of the DPLL is shown in Figure 10-16 In this figure the input word length of the individual blocks and the output truncation operations are shown The word length of the DPLL are found by bit-true computer simulation The notation used is summarized in Figure lo-17 below
Trang 18Ki=3:15
Ks=8:13
Figure lo-16 Block Diagram of the DPLL for Carrier Phase Synchronization
be within min 5 K,., S max
Figure lo-17 Notation Used in Figure lo-16
We next discuss the functional building block in some detail The incoming signal is multiplied by a rotating phasor exp [j (QM’/2 + &)I by means of a CORDIC algorithm [6,7], subsequently filtered in the matched filter and decimated
to symbol rate One notices that the matched filter is placed inside the closed loop This is required to achieve a sufficiently large SNR at the phase error detector input The loop filter has a proportional plus integral path The output rate of the loop filter is doubled to 2/T by repeating each value which is subsequently accumulated
in the NCO The accumulator is the digital equivalent to the integrator of the VCO
of an analog PLL The modulo 2~ reduction (shown as a separate block) of the accumulator is automatically performed by the adder if one uses a 2’s complement number representation The DPLL is brought into lock by applying a sweep value
to the accumulator in the loop filter The sweep value and the closing of the loop after detecting lock is controlled by the block acquisition control
Trang 1910.4 ASIC Design Case Study 551
Matched Filter
The complex-valued matched filter is implemented as two equivalent real FIR filters with identical coefficients To determine the number of taps and the word length of the filter coefficients, Figure lo-18 is helpful The lower part shows the number of coefficients which can be represented for a given numerical value of the center tap As an example, assume a center value of he = 15 which can be represented by a 5 bit word in 2’s complement representation From Figure lo- 18
it follows that the number of nonzero coefficients is nine Increasing the word length of ho to 6, the maximum number of coefficients is nine for ha 2 21 and
14 for 21 < he 5 31 The quadratic approximation error is shown in the upper part of Figure 10-18, again as a function of the center tap value The error decays slowly for values ho > 15 while the number of filter taps and the word length increase rapidly As a compromise between accuracy and complexity the design point of Figure lo-18 and Table 10-2 was chosen
In the following we will explain the detailed considerations that lead to the hardware implementation of the matched filter This example serves the purpose
to illustrate the close interaction between the algorithm and architecture design found in DSP systems
A suitable architecture of an FIR filter is the transposed direct form (Figure lo- 19) Since the coefficients are fixed, the area-intensive multipliers can be replaced
by sequences of shift-and-add operations As we see in Figure 10-20, each “1”
of a coefficient requires an adder Thus, we encounter an additional constraint on the system design: choosing the coefficient in a way that results in the minimum number of “1”
Matched Filter 2/T Quantlzatlon Reaulta
value of center tap
Figure lo-18 Matched Filter Approximation
Trang 207
Figure lo-19 FIR Filter in Transposed Direct Form
Figure lo-20 Replacement of Multipliers by Shift-and-Add Operations The number of l’s can be further reduced by changing the number representa- tion of the coefficients In filters with variable coefficients (e.g., in the interpolator filter) this can be performed by Booth encoding [S] For fixed coefficient filters each sequence of l’s (without the most significant bit) in a 2’s complement num- ber can be described in a canonical signed digit (CSD) representation with at most
c 1 x 2” = 1 x 2”+1 - 1 x 2N (10-10)
d=N
This leads to the matched filter coefficients in CSD format shown in Table 10-2
As a result, we need nine adders to implement a branch of the matched filter For these adders different implementation alternatives exist The simplest is the carry ripple adder (Figure 10-21) It consists of full adder cells This adder requires the smallest area but is the slowest one since the critical path consists
Table 10-2 Matched Filter Coefficients
Coefficient Numerical
Value 2’s complement
Canonical Signed Digit Representation
Trang 2110.4 ASIC Design Case Study 553
in, full adder
half adder
carry save representation
Figure lo-21 Carry Ripple versus Carry Save Adder
of the entire carry path By choosing an alternative number representation of the intermediate results, we obtain a speedup at a slightly increased area: the carry output of each adder is fed into the inputs of the following filter tap This carry save format is a redundant number representation since each numerical value can
be expressed by different binary representations Now, the critical path consists
of one full adder cell only
The word length of the intermediate results increases from left to right in Figure 10-19 This implies a growing size of the adders By reordering the partial products or the “bitplanes” [9, 10, 1 l] in such a way that the smallest intermediate results are added first, the increase of word length is the smallest possible Thus, the silicon area is minimized Taking into account that the requirements on the processing speed and the silicon technology allow to place three bitplanes between two register slices, we get the structural block diagram of the matched filter depicted in Figure 10-22
Trang 22input’ Input ’ Input4 input3 input 2 input’ input’
Figure lo-23 Detailed Block Diagram of Matched Filter
An additional advantage of the reordering of the bitplanes can be seen in Figure lo-23 which shows the structure in detail It consists of full and half adder cells and of registers The carry overflow correction [lo, 121 cells are required
to reduce the word length of the intermediate results in carry save representation The vector merging adder (VMA) converts the filter output from carry save back to 2’s complement representation Due to the early processing of the bitplanes with
Trang 2310.4 ASIC Design Case Study 555
the smallest numerical coefficient values, the least significant bits are computed first and may be truncated without side effects The word length of the VMA is decreased as well
Phase Error Detector
A decision-directed detector with hard quantized decisions is used (see Section 5.8):
g(eo - 8) = Im[ic;z,(+-je’] (10-11) with
hn = sign{Re[t,(g)e-j’]} +j sign{Im[z,(Z)e-j”]} (10-12) Inserting the hard quantized symbols tin into the previous equation we obtain
!I 00 ( - 8) = Re(&,)Im[z,(++‘] - Im(&)Re [~~(2)e-~‘] (10-13)