The Ultra High Speed LMS Algorithm Implemented on Parallel Architecture Suitable for Multidimensional Adaptive Filtering 79 Fig.. The Ultra High Speed LMS Algorithm Implemented on Paral
Trang 1The Ultra High Speed LMS Algorithm Implemented on
Parallel Architecture Suitable for Multidimensional Adaptive Filtering 79
Fig 42 The clean speech obtained at the output of our proposed ANC (Fig 30) by reducing
the time scale
8 The ultra high speed LMS algorithm implemented on parallel architecture
There are many problems that require enormous computational capacity to solve, and
therefore the success of computational science to accurately describe and model the real
world has helped to fuel the ever increasing demand for cheap computing power Scientists
are eager to find ways to test the limits of theories, using high performance computing to
allow them to simulate more realistic systems in greater detail Parallel computing offers a
way to address these problems in a cost effective manner Parallel Computing deals with
the development of programs where multiple concurrent processes cooperate in the
fulfilment of a common task Finally, in this section we will develop the theory of the
parallel computation of the widely used algorithms named the least-mean-square (LMS)
algorithm1 by its originators, Widrow and Hoff (1960) [2]
8.1 The spatial radix-r factorization
This section will be devoted in proving that discrete signals could be decomposed into r
partial signals and whose statistical properties remain invariant therefore, given a discrete
1 M Jaber “Method and apparatus for enhancing processing speed for performing a least mean square
operation by parallel processing” US patent No 7,533,140, 2009
Trang 2 l n, l c, rn rn1 rn p
is the product of the identity matrix of size r by r sets of vectors of size N/r (n = 0,1, , N/r -1)
where the l th element of the n th product is stored into the memory address location given by
x
E x N
(101) which could be factorizes as:
Similarly to the mean, the variance of the signal x n equal to sum of the variances of its r
partial signals according to:
= =
8.2 The parallel implementation of the least squares method
The method of least squares assumes that the best-fit curve of a given type is the curve that
has the minimal sum of the deviations squared (least square error) from a given set of data
Suppose that the N data points are (x0, y0), (x1, y1)… (x (n – 1) , y (n – 1)), where x is the
independent variable and y is the dependent variable The fitting curve d has the deviation
(error) σ from each data point, i.e., σ0 = d0 – y0, σ1 = d1 – y1 σ (n – 1) = d (n – 1) – d (n – 1) which
could be re-ordered as:
Trang 3The Ultra High Speed LMS Algorithm Implemented on
Parallel Architecture Suitable for Multidimensional Adaptive Filtering 81
According to the method of least squares, the best fitting curve has the property that:
0 0
0
0 0
for j0 = 1, …, r – 1 and in order to pick the line which best fits the data, we need a criterion to
determine which linear estimator is the “best” The sum of square errors (also called the
mean square error (MSE)) is a widely utilized performance criterion
1 2 0
N n n
11
N r
J is the partial MSE applied on the subdivided data
Our goal is to minimize J analytically, which according to Gauss can be done by taking its
partial derivative with respect to the unknowns and equating the resulting equations to
zero:
00
J b J w
Trang 40 1
0 0
0 0
r j r
j j
r j
2 1
12
=
21
N
N r
r J r
J is the partial MSE applied on the subdivided data
The solution to the extreme (minimum) of this equation can be found in exactly the same
way as before, that is, by taking the derivatives of
0
J with respect to the unknowns (w k), and equating the result to zero
Instead of solving equations 110 and 111 analytically, a gradient adaptive system can be
used which is done by estimating the derivative using the difference operator This
estimation is given by:
w
J J w
where in this case the bias b is set to zero
8.3 Search of the performance surface with steepest descent
The method of steepest descent (also known as the gradient method) is the simplest example
of a gradient based method for minimizing a function of several variables [12] In this section we
will be elaborating the linear case
Since the performance surface for the linear case implemented in parallel, are r paraboloids
each of which has a single minimum, an alternate procedure to find the best value of the
coefficient w j0 k is to search in parallel the performance surface instead of computing the
best coefficient analytically by Eq 110 The search for the minimum of a function can be
done efficiently using a broad class of methods that use gradient information The gradient has
two main advantages for search
The gradient can be computed locally
The gradient always points in the direction of maximum change
If the goal is to reach the minimum in each parallel segment, the search must be in the
direction opposite to the gradient So, the overall method of search can be stated in the
following way:
Trang 5The Ultra High Speed LMS Algorithm Implemented on
Parallel Architecture Suitable for Multidimensional Adaptive Filtering 83
Start the search with an arbitrary initial weight w j0 0 , where the iteration is denoted by the
index in parenthesis (Fig 43) Then compute the gradient of the performance surface atw j0 0 ,
and modify the initial weight proportionally to the negative of the gradient atw j0 0 This
changes the operating point tow j0 1 Then compute the gradient at the new position w j0 1 ,
and apply the same procedure again, i.e
denotes the gradient of the performance surface at the
k th iteration of j0 parallel segment η is used to maintain stability in the search by ensuring
that the operating point does not move too far along the performance surface This search
procedure is called the steepest descent method (fig 43)
Fig 43 The search using the gradient information [13]
If one traces the path of the weights from iteration to iteration, intuitively we see that if the
constant η is small, eventually the best value for the coefficient w* will be found Whenever
w>w*, we decrease w, and whenever w<w*, we increase w
8.4 The radix- r parallel LMS algorithm
Based on what was proposed in [2] by using the instantaneous value of the gradient as the
estimator for the true quantity which means by dropping the summation in equation 108 and
then taking the derivative with respect to w yields:
1
1
1
Trang 6What this equation tells us is that an instantaneous estimate of the gradient is simply the product
of the input to the weight times the error at iteration k This means that with one multiplication
per weight the gradient can be estimated This is the gradient estimate that leads to the
famous Least Means Square (LMS) algorithm (Fig 44)
If the estimator of Eq.114 is substituted in Eq.113, the steepest descent equation becomes
This equation is the r Parallel LMS algorithm, which is used as predictive filter, is illustrated
in Figure 45 The small constant η is called the step size or the learning rate
Jaber Product Device
n
e
Fig 45 r Parallel LMS Algorithm Used in Predictive Filter
Trang 7The Ultra High Speed LMS Algorithm Implemented on
Parallel Architecture Suitable for Multidimensional Adaptive Filtering 85
8.5 Simulation results
The notion of a mathematical model is fundamental to sciences and engineering In class
of applications dealing with identification, an adaptive filter is used is used to provide a linear model that represents the best fit (in some sense) to an unknown signal The LMS Algorithm which is widely used is an extremely simple and elegant algorithm that is able
to minimize the external cost function by using local information available to the system parameters Due to its computational burden and in order to speed up the process, this paper has presented an efficient way to compute the LMS algorithm in parallel where it follows from the simulation results that the stability of our models relies on the stability of
our r parallel adaptive filters It follows from figures 47 and 48 that the stability of r parallel LMS filters (in this case r = 2) has been achieved and the convergence
performance of the overall model is illustrated in figure 49 The complexity of the
proposed method will be reduced by a factor of r in comparison to the direct method
illustrated in figure 46 Furthermore, the simulation result of the channel equalization is illustrated in figure 50 in which the blue curves represents our parallel implementation (2 LMS implemented in parallel) compared to the conventional method where the curve is in
Trang 8first portion of error
Fig 47 Simulation Result of the first partial LMS Algorithm
Fig 48 Simulation Result of the second partial LMS Algorithm
Trang 9The Ultra High Speed LMS Algorithm Implemented on
Parallel Architecture Suitable for Multidimensional Adaptive Filtering 87
reconstructed error signal
Fig 49 Simulation Result of the Overall System
Trang 108 References
[1] S Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, 1991
[2] Widrow and Stearns, " Adaptive Signal Processing ", Prentice Hall 195
[3] K Mayyas, and T Aboulnasr, "A Robust Variable Step Size LMS-Type Algorithm:
Analysis and Simulations", IEEE 5-1995, pp 1408-1411
[4] T Aboulnasar, and K Mayyas, "Selective Coefficient Update of Gradient-Based Adaptive
Algorithms", IEEE 1997, pp 1929-1932
[5] E Bjarnason: "Analysis of the Filtered X LMS Algorithm ", IEEE 4 1993, pp 511,
III-514
[6] E.A Wan, "Adjoint LMS: An Efficient Alternative To The Filtered X LMS And Multiple
Error LMS Algorithm", Oregon Graduate Institute Of Science & Technology, Department Of Electrical Engineering And Applied Physics, P.O Box 91000, Portland, OR 97291
[7] B Farhang-Boroujeny: “Adaptive Filters, Theory and Applications”, Wiley 1999
[8] Wiener, Norbert (1949) “Extrapolation, Interpolation, and Smoothing of Stationary Time
Series”, New York: Wiley ISBN 0-262-73005-7
[9] M Jaber “Noise Suppression System with Dual Microphone Echo Cancellation US patent
no.US-6738482
[10] M Jaber, “Voice Activity detection Algorithm for Voiced /Unvoiced Decision and Pitch
Estimation in a Noisy Speech feature Extraction”, US patent application no 60/771167, 2007
[11] M Jaber and D Massicottes: “A Robust Dual Predictive Line Acoustic Noise Canceller”,
International Conference on Digital Signal Processing DSP 2009 Santorini Greece [12] M Jaber, D Massicotte, "A New FFT Concept for Efficient VLSI Implementation: Part I
– Butterfly Processing Element", 16th International Conference on Digital Signal Processing (DSP’09), Santorini, Greece, 5-7 July 2009
[13] J.C Principe, W.C Lefebvre, N.R Euliano, “Neural Systems: Fundamentals through
Simulation”, 1996
Trang 114
An LMS Adaptive Filter Using Distributed Arithmetic - Algorithms and Architectures
Kyo Takahashi1, Naoki Honma2 and Yoshitaka Tsunekawa2
1Iwate Industrial Research Institute
& Koizumi, 1988) Therefore, implementations of very high order adaptive filters are required In order to satisfy these requirements, highly-efficient algorithms and architectures are desired The adaptive filter is generally constructed by using the multipliers, adders and memories, and so on, whereas, the structure without multipliers has been proposed
The LMS adaptive filter using distributed arithmetic can be realized by using adders and memories without multipliers, that is, it can be achieved with a small hardware A Distributed Arithmetic (DA) is an efficient calculation method of an inner product of constant vectors, and it has been used in the DCT realization Furthermore, it is suitable for time varying coefficient vector in the adaptive filter Cowan and others proposed a Least Mean Square (LMS) adaptive filter using the DA on an offset binary coding (Cowan & Mavor, 1981; Cowan et al, 1983) However, it is found that the convergence speed of this me-thod is extremely degraded (Tsunekawa et al, 1999) This degradation results from an offset bias added to an input signal coded on the offset binary coding To overcome this problem,
an update algorithm generalized with 2’s complement representation has been proposed (Tsunekawa et al, 1999), and the convergence condition has been analyzed (Takahashi et al, 2002) The effective architectures for the LMS adaptive filter using the DA have been proposed (Tsunekawa et al, 1999; Takahashi et al, 2001) The LMS adaptive filter using distributed arithmetic is expressed by DA-ADF The DA is applied to the output calculation, i.e., inner product of the input signal vector and coefficient vector The output signal is obtained by the shift and addition of the partial-products specified with the bit patterns of the N-th order input signal vector This process is performed from LSB to MSB direction at the every sampling instance, where the B indicates the word length The B partial-products
Trang 12used to obtain the output signal are updated from LMB to MSB direction There exist 2N
partial-products, and the set including all the partial-products is called Whole Adaptive
Function Space (WAFS) Furthermore, the DA-ADF using multi-memory block structure
that uses the divided WAFS (MDA-ADF) (Wei & Lou, 1986; Tsunakawa et al, 1999) and the
MDA-ADF using half-memory algorithm based on the pseudo-odd symmetry property of
the WAFS (HMDA-ADF) have been proposed (Takahashi et al, 2001) The divided WAFS is
expressed by DWAFS
In this chapter, the new algorithm and effective architecture of the MDA-ADF are discussed
The objectives are improvements of the MDA-ADF permitting the increase of an amount of
hardware and power dissipation The convergence properties of the new algorithm are
evaluated by the computer simulations, and the efficiency of the proposed VLSI architecture
The output signal of an adaptive filter is represented as
and the wi(k) is an i-th tap coefficient of the adaptive filter
The Widrow’s LMS algorithm (Widrow et al, 1975) is represented as
k1 k 2e k k
where, the e(k), μ and d(k) are an error signal, a step-size parameter and the desired signal,
respectively The step-size parameter deterimines the convergence speed and the accuracy
of the estimation The error signal is obtained by
k d k yk
The fundamental structure of the LMS adaptive filter is shown in Fig 1 The filter input
signal s(k) is fed into the delay-line, and shifted to the right direction every sampling
instance The taps of the delay-line provide the delayed input signal corresponding to the
depth of delay elements The tap outputs are multiplied with the corresponding
coefficients, the sum of these products is an output of the LMS adaptive filter The error
signal is defined as the difference between the desired signal and the filter output signal
The tap coefficients are updated using the products of the input signals and the scaled
error signal
Trang 13An LMS Adaptive Filter Using Distributed Arithmetic - Algorithms and Architectures 91
Fig 1 Fundamental Structure of the 4-tap LMS adaptive filter
3 LMS adaptive filter using distributed arithmetic
In the following discussions, the fundamentals of the DA on the 2’s complement
representation and the derivation of the DA-ADF are explained The degradation of the
convergence property and the drastic increase of the amount of hardware in the DA-ADF
are the serious problems for its higher order implementation As the solutions to overcome
the problems, the multi-memory block structure and the half-memory algorithm based on
the pseudo-odd symmetry property of WAFS are explained
3.1 Distributed arithmetic
The DA is an efficient calculation method of an inner product by a table lookup method
(Peled &Liu, 1974) Now, let’s consider the inner product
1
N T
i i i
1 i
and
1 0 1
Trang 14In the Eq.(9), vik indicates the k-th bit of vi, i.e., 0 or 1 By substituting Eq.(9) for Eq.(6),
Eq.(10) indicates that the inner product of y is obtained as the weighted sum of the
partial-products The first term of the right side is weighted by -1, i.e., sign bit, and the following
terms are weighted by the 2-k Fig.2 shows the fundamental structure of the FIR filter using
the DA (DA-FIR) The function table is realized using the Read Only Memory (ROM), and
the right-shift and addition operation is realized using an adder and register The ROM
previously includes the partial-products determined by the tap coefficient vector and the
bit-pattern of the input signal vector From above discussions, the operation time is only
depended on the word length B, not on the number of the term N, fundamentally This
means that the output latency is only depended on the word length B The FIR filter using
the DA can be implemented without multipliers, that is, it is possible to reduce the amount
of hardware
Fig 2 Fundamental structure of the FIR filter using distributed arithmetic
3.2 Derivation of LMS adaptive algorithm using distributed arithmetic
The derivation of the LMS algorithm using the DA on 2’s complement representation is as
follows The N-th order input signal vector in Eq.(1) is defined as
Trang 15An LMS Adaptive Filter Using Distributed Arithmetic - Algorithms and Architectures 93
In Eq.(12) and Eq(13), an address matrix which is determined by the bit pattern of the input
signal vector is represented as
The P(k) is a subset of the WAFS including the elements specified by the row vectors (access
vectors) of the address matrix Now, multiplying both sides by AT(k), Eq.(4) becomes
To overcome this problem, the simplification of the term of AT(k)A(k)F in Eq.(21) has been also
achieved on the 2’s complement representation (Tsunekawa et al, 1999) By using the relation