Yagle University of Michigan 10.1 Introduction 10.2 Divide-and-Conquer Fast Matrix Multiplication Strassen Algorithm •Divide-and-Conquer•Arbitrary Preci-sion Approximation APA Algorithm
Trang 1Yagle, A.E “Fast Matrix Computations”
Digital Signal Processing Handbook
Ed Vijay K Madisetti and Douglas B Williams Boca Raton: CRC Press LLC, 1999
Trang 2Fast Matrix Computations
Andrew E Yagle
University of Michigan
10.1 Introduction 10.2 Divide-and-Conquer Fast Matrix Multiplication Strassen Algorithm •Divide-and-Conquer•Arbitrary
Preci-sion Approximation (APA) Algorithms•Number Theoretic Transform (NTT) Based Algorithms
10.3 Wavelet-Based Matrix Sparsification Overview •The Wavelet Transform•Wavelet Representations
of Integral Operators •Heuristic Interpretation of Wavelet
Sparsification References
10.1 Introduction
This chapter presents two major approaches to fast matrix multiplication We restrict our attention
to matrix multiplication, excluding matrix addition and matrix inversion, since matrix addition admits no fast algorithm structure (save for the obvious parallelization), and matrix inversion (i.e., solution of large linear systems of equations) is generally performed by iterative algorithms that require repeated matrix-matrix or matrix-vector multiplications Hence, matrix multiplication is the real problem of interest
We present two major approaches to fast matrix multiplication The first is the divide-and-conquer strategy made possible by Strassen’s [1] remarkable reformulation of non-commutative 2× 2 matrix multiplication We also present the APA (arbitrary precision approximation) algorithms, which improve on Strassen’s result at the price of approximation, and a recent result that reformulates matrix multiplication as convolution and applies number theoretic transforms The second approach is to use a wavelet basis to sparsify the representation of Calderon-Zygmund operators as matrices Since electromagnetic Green’s functions are Calderon-Zygmund operators, this has proven to be useful
in solving integral equations in electromagnetics The sparsified matrix representation is used in
an iterative algorithm to solve the linear system of equations associated with the integral equations, greatly reducing the computation We also present some new insights that make the wavelet-induced sparsification seem less mysterious
10.2 Divide-and-Conquer Fast Matrix Multiplication
10.2.1 Strassen Algorithm
It is not obvious that there should be any way to perform matrix multiplication other than using the definition of matrix multiplication, for which multiplying twoN × N matrices requires N3
Trang 3multiplications and additions (N for each of the N2elements of the resulting matrix) However, in
1969 Strassen [1] made the remarkable observation that the product of two 2× 2 matrices
a1,1 a1,2
a2,1 a2,2
b1,1 b1,2
b2,1 b2,2
=
c1,1 c1,2
c2,1 c2,2
(10.1) may be computed using only seven multiplications (fewer than the obvious eight), as
m1 = (a1,2 − a2,2 )(b2,1 + b2,2 ); m3= (a1,1 − a2,1 )(b1,1 + b1,2 )
m2 = (a1,1 + a2,2 )(b1,1 + b2,2 )
m4 = (a1,1 + a1,2 )b2,2 ; m7= (a2,1 + a2,2 )b1,1
m5 = a1,1 (b1,2 − b2,2 ); m6= a2,2 (b2,1 − b1,1 )
c1,1 = m1+ m2− m4+ m6; c1,2 = m4+ m5
A vital feature of (10.2) is that it is non-commutative, i.e., it does not depend on the commutative
property of multiplication This can be seen easily by noting that each of them i are the product
of a linear combination of the elements ofA by a linear combination of the elements of B, in that
order, so that it is never necessary to use, saya2,2 b2,1 = b2,1 a2,2 We note there exist commutative algorithms for 2× 2 matrix multiplication that require even fewer operations, but they are of little practical use
The significance of noncommutativity is that the noncommutative algorithm (10.2) may be applied
as is to block matrices That is, if the a i,j , b i,j andc i,j in (10.1) and (10.2) are replaced by block matrices, (10.2) is still true Since matrix multiplication can be subdivided into block submatrix operations (i.e (10.1) is still true ifa i,j , b i,jandc i,jare replaced by block matrices), this immediately leads to a divide-and-conquer fast algorithm
10.2.2 Divide-and-Conquer
To see this, consider the 2n× 2nmatrix multiplication AB = C, where A, B, C are all 2 n× 2n
matrices Using the usual definition, this requires(2 n )3 = 8nmultiplications and additions But
ifA, B, C are subdivided into 2 n−1× 2n−1blocksa i,j , b i,j , c i,j, thenAB = C becomes (10.1), which can be implemented with (10.2) since (10.2) does not require the products of subblocks ofA
andB to commute Thus the 2 n× 2nmatrix multiplicationAB = C can actually be implemented
using only seven matrix multiplications of 2n−1× 2n−1subblocks ofA and B And these subblock
multiplications can in turn be broken down by using (10.2) to implement them as well The end result
is that the 2n×2nmatrix multiplicationAB = C can be implemented using only 7 nmultiplications,
instead of 8n.
The computational savings grow as the matrix size increases Forn = 5 (32 × 32 matrices) the
savings is about 50% Forn = 12 (4096 × 4096 matrices) the savings is about 80% The savings as
a fraction can be made arbitrarily close to unity by taking sufficiently large matrices Another way of looking at this is to note thatN × N matrix multiplication requires O(Nlog27) = O(N2.807 ) < N3
multiplications using Strassen
Of course we are not limited to subdividing into 2× 2 = 4 subblocks Fast non-commutative algorithms for 3× 3 matrix multiplication requiring only 23 < 33= 27 multiplications were found
by exhaustive search in [2] and [3]; 23 is now known to be optimal Repeatedly subdividingAB = C
into 3× 3 = 9 subblocks computes a 3n× 3nmatrix multiplication in 23n < 27 nmultiplications;
N × N matrix multiplication requires O(Nlog323) = O(N2.854 ) multiplications, so this is not quite
as good as using (10.2) A fast noncommutative algorithm for 5× 5 matrix multiplication requiring only 102< 53 = 125 multiplications was found in [4]; this also seems to be optimal Using this
Trang 4algorithm,N × N matrix multiplication requires O(Nlog5102) = O(N2.874 ) multiplications, so
this is even worse Of course, the idea is to writeN = 2 a3b5c for somea, b, c and subdivide into
2× 2 = 4 subblocks a times, then subdivide into 3 × 3 = 9 subblocks b times, etc The total number
of multiplications is then 7a23b102c < 8 a27b125c = N3
Note that we have not mentioned additions Readers familiar with nesting fast convolution algo-rithms will know why; now we review why reducing multiplications is much more important than reducing additions when nesting algorithms The reason is that at each nesting stage (reversing the divide-and-conquer to build up algorithms for multiplying large matrices from (10.2)), each scalar addition is replaced by a matrix addition (which requiresN2additions forN × N matrices), and
each scalar multiplication is replaced by a matrix multiplication (which requiresN3multiplications
and additions for N × N matrices) Although we are reducing N3to aboutN2.8, it is clear that each
multiplication will produce more multiplications and additions as we nest than each addition So
reducing the number of multiplications from eight to seven in (10.2) is well worth the extra additions incurred In fact, the number of additions is alsoO(N2.807 ).
The design of these base algorithms has been based on the theory of bilinear and trilinear forms The review paper [5] and book [6] of Pan are good introductions to this theory We note that reducing the exponent ofN in N × N matrix multiplication is an area of active research This exponent has
been reduced to below 2.5; a known lower bound is two However, the resulting algorithms are too complicated to be useful
10.2.3 Arbitrary Precision Approximation (APA) Algorithms
APA algorithms are noncommutative algorithms for 2× 2 and 3 × 3 matrix multiplication that require even fewer multiplications than the Strassen-type algorithms, but at the price of requiring longer word lengths Proposed by Bini [7], the APA algorithm for multiplying two 2× 2 matrices is this:
p1 = (a2,1 + a1,2 )(b2,1 + b1,2 ) ;
p2 = (−a2,1 + a1,1 )(b1,1 + b1,2 )
p3 = (a2,2 − a1,2 )(b2,1 + b2,2 ) ;
p4 = a2,1 (b1,1 − b2,1 ) ;
p5 = (a2,1 + a2,2 )b2,1
c1,1 = (p1+ p2+ p4)/ − (a1,1 + a1,2 )b1,2;
c2,1 = p4+ p5;
c2,2 = (p1+ p3− p5)/ − a1,2 (b1,2 − b2,2 ) (10.3)
If we now let → 0, the second terms in (10.3) become negligible next to the first terms, and so they need not be computed Hence, three of the four elements ofC = AB may be computed using
only five multiplications c1,2may be computed using a sixth multiplication, so that, in fact, two
2× 2 matrices may be multiplied to arbitrary accuracy using only six multiplications The APA 3 × 3 matrix multiplication algorithm requires 21 multiplications Note that APA algorithms improve on the exact Strassen-type algorithms (6< 7, 21 < 23).
The APA algorithms are often described as being numerically unstable, due to roundoff error as
→ 0 We believe that an electrical engineering perspective on these algorithms puts them in a light
different from that of the mathematical perspective In fixed point implementation, the computation
AB = C can be scaled to operations on integers, and the p ican be bounded Then it is easy to set
a sufficiently small (negative) power of two to ensure that the second terms in (10.3) do not overlap the first terms, provided that the wordlength is long enough Thus, the reputation for instability
Trang 5is undeserved However, the requirement of large wordlengths to be multiplied seems also to have escaped notice; this may be a more serious problem in some architectures
The divide-and-conquer and resulting nesting of APA algorithms work the same way as for the Strassen-type algorithms.N×N matrixmultiplicationusing(10.3) requiresO(Nlog 2(6)) = O(N2.585 )
multiplications, which improves on theO(N2.807 ) multiplications using (10.2) But the wordlengths are longer
A design methodology for fast matrix multiplication algorithms by grouping terms has been proposed in a series of papers by Pan (see References [5] and [6]) While this has proven quite fruitful, the methodology of grouping terms becomes somewhat ad hoc
10.2.4 Number Theoretic Transform (NTT) Based Algorithms
An approach similar in flavor to the APA algorithms, but more flexible, has been taken recently in [8] First, matrix multiplication is reformulated as a linear convolution, which can be implemented as the multiplication of two polynomials using the z-transform Second, the variablez is scaled, producing
a scaled convolution, which is then made cyclic This aliases some quantities, but they are separated
by a power of the scaling factor Third, the scaled convolution is computed using pseudo-number-theoretic transforms Finally, the various components of the product matrix are read off of the convolution, using the fact that the elements of the product matrix are bounded This can be done without error if the scaling factor is sufficiently large
This approach yields algorithms that require the same number of multiplications or fewer as APA for 2× 2 and 3 × 3 matrices The multiplicands are again sums of scaled matrix elements as in APA However, the design methodology is quite simple and straightforward, and the reason why the fast algorithm exists is now clear, unlike the APA algorithms Also, the integer computations inherent in this formulation make possible the engineering insights into APA noted above
We reformulate the product of twoN ×N matrices as the linear convolution of a sequence of length
N2and a sparse sequence of lengthN3− N + 1 This results in a sequence of length N3+ N2− N,
from which elements of the product matrix may be obtained For convenience, we write the linear convolution as the product of two polynomials This result (of [8]) seems to be new, although a similar result is briefly noted in ([3], p 197) Define
N−1X
i=0
N−1X
j=0
a i+jN x i+jN
N−1X
i=0
N−1X
j=0
=
N3+NX2−N−1
i=0
c i x i
c i,j = c N2−N+i+jN2; 0 ≤ i, j ≤ N − 1 (10.4) Note that coefficients of all three polynomials are read off of the matricesA, B, C column-by-column
(each column ofB is reversed), and the result is noncommutative For example, the 2 × 2 matrix
multiplication (10.1) becomes
a1,1 + a2,1 x + a1,2 x2+ a2,2 x3
b2,1 + b1,1 x2+ b2,2 x4+ b1,2 x6
= ∗ + ∗x + c1,1 x2+ c2,1 x3+ ∗x4+ ∗x5+ c1,2 x6+ c2,2 x7+ ∗x8+ ∗x9, (10.5)
Trang 6where∗ denotes an irrelevant quantity In (10.5) substitutex = sz and take the result mod(z6− 1).
This gives
a1,1 + a2,1 sz + a1,2 s2z2+ a2,2 s3z3
(b2,1 + b1,2 s6) + b1,1 s2z2+ b2,2 s4z4
= (∗ + c1,2 s6) + (∗s + c2,2 s7)z + (c1,1 s2+ ∗s8)z2
+ (c2,1 s3+ ∗s9)z3+ ∗z4+ ∗z5; mod(z6− 1) (10.6)
If|c i,j |, | ∗ | < s6then the∗ and c i,j may be separated without error, since both are known to be integers Ifs is a power of two, c0,1may be obtained by discarding the 6 log2s least significant bits in
the binary representation of∗+c0,1 s6 The polynomial multiplicationmod(z6−1) can be computed
using number-theoretic transforms [9] using six multiplications Hence, 2×2 matrix multiplication requires six multiplications Similarly, 3× 3 matrices may be multiplied using 21 multiplications Note these are the same numbers required by the APA algorithms, quantities multiplied are again sums of scaled matrix elements, and results are again sums in which one quantity is partitioned from another quantity which is of no interest
However, this approach is more flexible than the APA approach (see [8]) As an extreme case, settingz = 1 in (10.5) computes a 2× 2 matrix multiplication using ONE (very long wordlength) multiplication! For example, usings = 100
2 4
3 5
9 8
7 6
=
46 40
62 54
(10.7) becomes the single scalar multiplication
(5, 040, 302)(8, 000, 600, 090, 007) = 40, 325, 440, 634, 862, 462, 114 (10.8) This is useful in optical computing architectures for multiplying large numbers
10.3 Wavelet-Based Matrix Sparsification
10.3.1 Overview
A common application of solving large linear systems of equations is the solution of integral equations arising in, say, electromagnetics The integral equation is transformed into a linear system of equations using Galerkin’s method, so that entries in the matrix and vectors of knowns and unknowns are coefficients of basis functions used to represent the continuous functions in the integral equation Intelligent selection of the basis functions results in a sparse (mostly zero entries) system matrix The sparse linear system of unknowns is then usually solved using an iterative algorithm, which is where the sparseness becomes an advantage (iterative algorithms require repeated multiplication of the system matrix by the current approximation to the vector of unknowns)
Recently, wavelets have been recognized as a good choice of basis function for a wide variety of applications, especially in electromagnetics This is true because in electromagnetics the kernel of the integral equation is a 2-D or 3-D Green’s function for the wave equation, and these are Calderon-Zygmund operators Using wavelets as basis functions makes the matrix representation of the kernel drop off rapidly away from the main diagonal, more rapidly than discretization of the integral equation would produce
Here we quickly review the wavelet transform as a representation of continuous functions and show how it sparsifies Calderon-Zygmund integral operators We also provide some insight into why this happens and present some alternatives that make the sparsification less mysterious We present our results in terms of continuous (integral) operators, rather than discrete matrices, since this is the proper presentation for applications, and also since similar results can be obtained for the explicitly discrete case
Trang 710.3.2 The Wavelet Transform
We will not attempt to present even an overview of the rich subject of wavelets The reader is urged to consult the many papers and textbooks (e.g., [10]) now being published on the subject Instead, we restrict our attention to aspects of wavelets essential to sparsification of matrix operator representations
The wavelet transform of anL2functionf (x) is defined as
f i (n) = 2 i/2Z ∞
−∞f (x)ψ (2 i x − n)dx; f (x) =X
i
X
n
f i (n)ψ(2 i x − n)2 i/2 (10.9)
where{ψ(2 i x −n), i, n ∈ Z} is a complete orthonormal basis for L2 That isL2(the space of square-integrable functions) is spanned by dilations (scaling) and translations of a wavelet basis function
ψ(x) Constructing this ψ(x) is nontrivial, but has been done extensively in the literature.
Since the summations must be truncated to finite intervals in practice, we define the wavelet scaling functionφ(x) whose translations on a given scale span the space spanned by the wavelet basis function ψ(x) at all translations and at scales coarser than the given scale Then we can write
f (x) = 2 I/2X
n
c I (n)φ(2 I x − n) +X∞
i=I
X
n
f i (n)ψ(2 i x − n)2 i/2
c I (n) = 2 I/2
Z ∞
So the projectionc I (n) of f (x) on the scaling function φ(x) at scale I replaces the projections
f i (n) on the basis function ψ(x) on scales coarser (smaller) than I The scaling function φ(x) is
orthogonal to its translations but (unlike the basis functionψ(x)) is not orthogonal between scales.
Truncating the summation at the upper end approximatesf (x) at the resolution defined by the finest
(largest) scalei; this is somewhat analogous to truncating Fourier series expansions and neglecting
high-frequency components
We also define the 2-D wavelet transform off (x, y) as
f i,j (m, n) = 2 i/22j/2Z ∞
−∞
Z ∞
−∞f (x, y)ψ(2 i x − m)ψ(2 j y − n)dx dy
i,j,m,n
f i,j (m, n)ψ(2 i x − m)ψ(2 j y − n)2 i/22i/2 (10.11)
However, it is more convenient to use the 2-D counterpart of (10.10), which is
c I (m, n) = 2 IZ ∞
−∞
Z ∞
−∞f (x, y)φ(2 I x − m)φ(2 I y − n)dx dy
f1
i (m, n) = 2 i
Z ∞
−∞
Z ∞
−∞f (x, y)φ(2 i x − m)ψ(2 i y − n)dx dy
f2
i (m, n) = 2 i
Z ∞
−∞
Z ∞
−∞f (x, y)ψ(2 i x − m)φ(2 i y − n)dx dy
f3
i (m, n) = 2 iZ ∞
−∞
Z ∞
−∞f (x, y)ψ(2 i x − m)ψ(2 i y − n)dx dy
m,n
c I (m, n)φ(2 I x − m)φ(2 I y − n)2 I
Trang 8i=I
X
m,n
f1
i (m, n)φ(2 i x − m)ψ(2 i y − n)2 i
+
∞
X
i=I
X
m,n
f2
i (m, n)ψ(2 i x − m)φ(2 i y − n)2 i
+X∞
i=I
X
m,n
f3
i (m, n)ψ(2 i x − m)ψ(2 i y − n)2 i (10.12)
Once again the projectionc I (m, n) on the scaling function at scale I replaces all projections on the
basis functions on scales coarser thanM.
Some examples of wavelet scaling and basis functions:
Wavelet Haar Battle-Lemarie Paley-Littlewood Meyer Daubechies
An important property of the wavelet basis functionψ(x) is that its first k moments can be made
zero, for any integerk [10]:
Z ∞
10.3.3 Wavelet Representations of Integral Operators
We wish to use wavelets to sparsify theL2integral operatorK(x, y) in
g(x) =
Z ∞
A common situation: (10.14) is an integral equation with known kernelK(x, y) and known g(x)
in which the goal is to compute an unknown functionf (y) Often the kernel K(x, y) is the Green’s
function (spatial impulse response) relating observed wave field or signalg(x) to unknown source
field or signalf (y).
For example, the Green’s function for Laplace’s equation in free space is
G(r) = − 1
2π logr (2D);
1
wherer is the distance separating the points of source and observation Now consider a line source in
an infinite 2-D homogeneous medium, with observations made along the same line The observed field strengthg(x) at position x is
g(x) = − 1
2π
Z ∞
wheref (y) is the source strength at position y.
Using Galerkin’s method, we expandf (y) and g(x) as in (10.9) andK(x, y) as in (10.11) Using the orthogonality of the basis functions yields
X
j
X
n
Expandingf (y) and g(x) as in (10.10) andK(x, y) as in (10.12) leads to another system of equations which is difficult notationally to write out in general, but can clearly be done in individual applications
Trang 9We note here that the entries in the system matrix in this latter case can be rapidly generated using the fast wavelet algorithm of Mallat (see [10])
The point of using wavelets is as follows.K(x, y) is a Calderon-Zygmund operator if
| ∂ k
∂x k K(x, y)| + | ∂ k
∂y k K(x, y)| ≤ C k
for somek ≥ 1 Note in particular that the Green’s functions in (10.15) are Calderon-Zygmund operators Then the representation (10.12) ofK(x, y) has the property [11]
|f1
i (m, n)| + |f2
i (m, n)| + |f3
1+ |m − n| k+1 , |m − n| > 2k (10.19)
if the wavelet basis functionψ(x) has its first k moments zero (10.13)
This means that using wavelets satisfying (10.13) sparsifies the matrix representation of the kernel K(x, y) For example, a direct discretization of the 3-D Green’s function in (10.15) decays as 1/|m−n|
as one moves away from the main diagonalm = n in its matrix representation However, using
wavelets, we can attain the much faster decay rate 1/(1+|m − n| k+1 ) far away from the main diagonal.
By neglecting matrix entries less than some threshold (typically 1% of the largest entry) a sparse and mostly banded matrix is obtained This greatly speeds up the following matrix computations:
1 Multiplication by the matrix for solving the forward problem of computing the response
to a given excitation (as in (10.16));
2 Fast solution of the linear system of equations for solving the inverse problem of re-constructing the source from a measured response (solving (10.16) as an integral equa-tion) This is typically performed using an iterative algorithm such as conjugate gradient method Sparsification is essential for convergence in a reasonable time
A typical sparsified matrix from an electromagnetics application is shown in Figure 6 of [12] Battle-Lemarie wavelet basis functions were used to sparsify the Galerkin method matrix in an integral equation for planar dielectric millimeter-wave waveguides and a 1% threshold applied (see [12] for details) Note that the matrix is not only sparse but (mostly) banded
10.3.4 Heuristic Interpretation of Wavelet Sparsification
Why does this sparsification happen? Considerable insight can be gained using (10.13) Let ˆψ(ω)
be the Fourier transform of the wavelet basis functionψ(x) Since the first k moments of ψ(x) are
zero by (10.13) we can expand ˆψ(ω) in a power series around ω = 0:
This shows that for small|ω| taking the wavelet transform of f (x) is roughly equivalent to taking
thek thderivative off (x) This can be confirmed that many wavelet basis functions bear a striking
resemblance to the impulse responses of regularized differentiators SinceK(x, y) is assumed a
Calderon-Zygmund operator, itsk thderivatives inx and y drop off as 1/|x − y| k+1 Thus, it is not
surprising that the wavelet transform ofK(x, y), which is roughly taking k thderivatives, should drop
off as 1/|m − n| k+1 Of course there is more to it, but this is why it happens.
It is not surprising thatK(x, y) can be sparsified by taking advantage of its derivatives being small.
To see a more direct way of accomplishing this, apply integration by parts to (10.14) and take the partial derivative with respect tox This gives
dg(x)
Z ∞
−∞
∂
∂x
∂
∂y K(x, y)
Z y
−∞f (y0)dy0
Trang 10which will likely sparsify a smoothK(x, y) Of course, higher derivatives can be used until a condition
like (10.18) is reached The operations of integratingf (y) and ∂ ∂x k g k (to getg(x)) k times can be
accomplished usingnk << n2additions, so considerable savings can result This is different from using wavelets, but in the same spirit
References
[1] Strassen, V., Gaussian elimination is not optimal,Numer Math., 13: 354–356, 1969.
[2] Landerman, J.D., A noncommutative algorithm for multiplying 3× 3 matrices using 23 mul-tiplications,Bull Am Math Soc., 82: 127–128, 1976.
[3] Johnson, R.W and McLoughlin, A.M., Noncommutative bilinear algorithms for 3× 3 matrix multiplication,SIAM J Comput., 15: 595–603, 1976.
[4] Makarov, O.M., A noncommutative algorithm for multiplying 5× 5 matrices using 102 mul-tiplications,Inform Proc Lett., 23: 115–117, 1986.
[5] Pan, V., How can we speed up matrix multiplication?SIAM Rev., 26(3): 393–415, 1984.
[6] Pan, V.,How Can We Multiply Matrices Faster?, Springer-Verlag, New York, 1984.
[7] Bini, D., Capovani, M., Lotti, G and Romani, F.,O(n2.7799 ) complexity for matrix
multipli-cation,Inform Proc Lett., 8: 234–235, 1979.
[8] Yagle, A.E., Fast algorithms for matrix multiplication using pseudo number theoretic trans-forms,IEEE Trans Signal Process., 43: 71–76, 1995.
[9] Nussbaumer, H.J.,Fast Fourier Transforms and Convolution Algorithms, Springer-Verlag,
Berlin, 1982
[10] Daubechies, I.,Ten Lectures on Wavelets, SIAM, Philadelphia, PA, 1992.
[11] Beylkin, G., Coifman, R and Rokhlin, V., Fast wavelet transforms and numerical algorithms I,
Comm Pure Appl Math., 44: 141–183, 1991.
[12] Sabetfakhri, K and Katehi, L.P.B., Analysis of integrated millimeter wave and submillime-ter wave waveguides using orthonormal wavelet expansions,IEEE Trans Microwave Theor Technol., 42: 2412–2422, 1994.