1.3 Neural Network Solutions to Signal Processing Problems 1.3.1 Digital Signal Processing In the most general sense a signal is a physical quantity that is a function of one or more ind
Trang 1Assume that gi (x) = 1 (hence gk (x) = 0, k = i), update the expert i based on output error.
Update gating network so that g i (x) is even closer to unity.
Alternatively, a batch training method can be adopted:
1 Apply a clustering algorithm to cluster the set of training samples into n clusters Use
the membership information to train the gating network
2 Assign each cluster to an expert module and train the corresponding expert module
3 Fine-tune the performance using gradient-based learning
Note that the function of the gating network is to partition the feature space into largely disjointed regions and assign each region to an expert module In this way, an individual expert module only needs to learn a subregion in the feature space and is likely to yield better performance
Combining n expert modules under the gating network, the overall performance is expected to
improve Figure 1.19 shows an example using the batch training method presented above The dots are the training and testing samples The circles are the cluster centers that represent individual experts These cluster centers are found by applying the k-means clustering algorithm on the training samples The gating network output is proportional to the inverse of the square distance from each sample to all three cluster centers The output value is normalized so that the sum equals unity Each expert module implements a simple linear model (a straight line in this example) We did not implement the third step, so the results are obtained without fine-tuning The corresponding MATLAB m-files are moedemo.m and moegate.m
1.19 Illustration of mixture of expert network using batched training method.
1.2.6 Support Vector Machines (SVMs)
A support vector machine [14] has a basic format, as depicted in Figure 1.20, where ϕ k (x) is a
nonlinear transformation of the input feature vector x into a high-dimensional space new feature
vector ϕ(x) = [ϕ1(x) ϕ2(x) ϕ p (x)] The output y is computed as:
y(x) =
p
k=1
wk ϕk(x) + b = ϕ(x)wT+ b
where w = [w1w2 w p ] is the 1 × p weight vector, and b is the bias term The dimension of
ϕ(x)(= p) is usually much larger than that of the original feature vector (= m) It has been argued
Trang 2that mapping a low-dimensional feature into a higher-dimensional feature space will likely make the
resulting feature vectors linearly separable In other words, using ϕ as a feature vector is likely to
result in better pattern classification results
1.20 An SVM neural network structure.
Given a set of training vectors {x(i); 1 ≤ i ≤ N}, one can solve the weight vector w as:
w =N
i=1
γ i ϕ(x(i)) = γ 4
where 4 = [ϕ(x(1)) ϕ(x(2)) ϕ(x(N))]Tis an N ×p matrix, and γ is a 1×N vector Substituting
w into y(x) yields:
y(x) = ϕ(x)wT+ b =N
i=1
γ i ϕ(x)ϕT(x(i)) + b =N
i=1
γ i K(x, x(i)) + b
where the kernel K(x, x(i)) is a scalar-valued function of the testing sample x and a training sample x(i) For N << p, one may choose to use γ and K(x, x(i)) to evaluate y(x) instead of using w and ϕ(x) explicitly For this purpose, one must estimate γ and b and identify a set of support vectors {x(i); 1 ≤ i ≤ N} that may be a subset of the entire training set of data samples.
Commonly used kernel functions are summarized in Table 1.4
TABLE 1.4 List of Commonly Used Kernel Functions for Support Vector Machines (SVMs)
Type of SVM K(x, y) Comments
Polynomial learning machine xTy + 1p p : selected a priori
Radial basis function exp− 1
2σ 2x − y2 σ2: selected a priori
Two-layer perceptron tanhβ oxTy + β1 Only some β o and β1values are feasible
Trang 31.21 A linearly separable pattern classification example.ρis the distance between each class to the decision boundary.
To identify the support vectors from a set of training data samples, consider the linearly separable pattern classification example shown in Figure 1.21 According to Cortes and Vapnik [15], the empirical risk is minimized in a linearly separable two-class pattern classification problem, as shown
in Figure 1.21, if the decision boundary is located such that the minimum distance from each training
sample of each class to the decision boundary is maximized In other words, the parameter ρ in
Figure 1.21 should be maximized subject to the constraints that all “o” class samples should be on one side of the decision boundary, and all “x” class samples should be on the other side of the decision boundary This can be formulated as a nonlinear constrained quadratic optimization problem Using
a Karush–Kühn–Tucker condition, it can be shown that not all training samples will contribute to the determination of the decision boundary In fact, as shown in Figure 1.21, only those training samples that are closest to the decision boundary (marked with color in the figure) will contribute to
the solution of w and b These training samples will then be identified as the support vectors.
There are many public domain implementations of SVM They include a support vector machine MATLAB toolbox (S.R.Gunn@ecs.soton.ac.uk), a C implementation SVM_light (http://ais.gmd.de/~ thorsten/svm_light/), and a recent release of BSVM (http://www.csie.ntu.edu.tw/~cjlin/) Figure 1.22
shows an example using an SVM toolbox to solve a linearly separable problem with a radial basis kernel The three support vectors are labeled with white dots and the decision boundary and the gap are also illustrated
1.3 Neural Network Solutions to Signal Processing Problems
1.3.1 Digital Signal Processing
In the most general sense a signal is a physical quantity that is a function of one or more independent variables such as time or spatial coordinates A signal can be naturally occurring or artificially synthesized It can be the temperature variations in a building, a stock price quote, the faint radiation from a distant galaxy, or the brain waves from a human body
How do we use the signals obtained from various measurements? Simply put, a signal carries information Based on building temperature readings, we may turn the building’s heater on or off Based on a stock price quote, we may buy or sell stocks The faint radiation from a distant galaxy may reveal the secret of the universe Brain waves from a human body may be used to communicate and control external devices In short, the purpose of signal processing is to exploit inherent information
Trang 41.22 Illustration of support vector machine classification result.
carried by the signal More specifically, by processing a signal, we can manipulate the information by injecting new information into the signal or by extracting inherent information from the signal There are many ways to process signals One may filter, transform, transmit, estimate, detect, recognize, synthesize, record, or reproduce a signal
Perhaps the most comprehensive definition of signal processing is the Field of Interests statement
of the IEEE (Institute of Electrical and Electronics Engineering) Signal Processing Society, which states that signal processing concerns
theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals by digital or analog devices or techniques The term “signal” includes audio, video, speech, image, communications, geophysical, sonar, radar, medical, musical, and other signals.
If a signal is a function of time only, it is a one-dimensional signal If the time variable is continuous, the corresponding signal is a continuous time signal Most real world signals are continuous time signals A continuous time signal can be sampled at regular time intervals to yield a discrete time signal A discrete time signal can be described using a sequence of numbers To process a discrete time signal using digital computers, the value of each discrete time sample may also be quantized to finite precision to fit into the internal word length of the computer
1.3.1.1 A Taxonomy of Digital Signal Processing (DSP) Algorithms
A DSP algorithm describes how to process a given signal Depending on their assumptions
of the underlying signal, the mathematical formulations, DSP algorithms can be characterized in a number of different dimensions:
Deterministic vs statistical signal processing — In a statistical DSP algorithm, it is assumed
that the underlying signal is generated from a probabilistic model No such model is assumed in a deterministic DSP algorithm Almost all the neural network application examples we encountered concerned statistical signal processing applications
Trang 5Linear vs nonlinear signal processing — A linear signal processing algorithm is a linear system
(linear operator) operating on the incoming signal If a particular signal is a weighted sum of two different signals, then the output of this signal after applying a linear operator will also be a weighted sum of the outputs of those two different signals This superimposition property is unique
to linear signal processing algorithms Neural network applications to signal processing are mostly for nonlinear signal processing algorithms
Data-adaptive vs data-independent formulation — A data-independent signal processing
algo-rithm has fixed parameters that do not depend on specific data samples to be processed On the other hand, a data-adaptive algorithm will adjust its parameters based on the signal presented to the algorithm Thus, data-adaptive algorithms need a training phase to acquire specific values of parameters Most neural network based signal processing algorithms are data adaptive
Memoryless vs dynamic system — The output of a signal processing algorithm may depend on both
the present input signal as well as a signal in the past Usually, the signal in the past is summarized
as a state vector Such a system is called a dynamic system that has memory (remembering the past)
A memoryless system’s output is dependent only on the present input While linear dynamic system theory has been well developed, nonlinear dynamic system theory that incorporates neural networks
is still an ongoing research area
1.3.1.2 Nonlinear Filtering
There are many reports on using artificial neural networks to perform nonlinear filtering of a signal for the purposes of noise reduction and signal enhancement However, due to the nonlinear nature, most applications must be developed for specific training corpus and are data dependent Therefore, to apply ANN for nonlinear filtering, one must be able to collect an extensive set of training samples to cover all possible situations and develop a neural network to adapt to the given training set
For example [16], an MLP-based neural filter is developed to remove quantum noise from X-ray images, while at the same time trying to enhance the edge of the images The purpose is to replace current high dosage X-ray film with low dosage X-ray while improving the quality of the image
In this application, high dosage X-ray film with high-pass filtered edge enhancement is used as the target A simulated low-dosage X-ray image, derived from the original high-dosage X-ray image, is used as the input to the MLP The resulting SNR improvement of a testing data set is used to gauge the effectiveness of this approach
1.3.1.3 Linear Transformations
A linear transformation transforms a block (vector) of signal into a different vector space where special properties may be exploited For example, a discrete Fourier transform transforms a time domain signal into frequencies in the frequency domain A discrete wavelet transform transforms
to and back from scale space representation of the signal A very important application of linear transformation is the transform-based signal compression The original signal is first transformed using a linear transformation such as the fast Fourier transform, the discrete cosine transform, or the discrete wavelet transform into the frequency domain The purpose is to compact the energy in the original signal into a few large frequency coefficients By encoding these very few large frequency coefficients, the original signal can be compressed with a high compression ratio
Another popular data-dependent linear transform is called principal component analysis (PCA)
or, sometimes, Karhunen–Loeve expansion (KL expansion) The main difference between PCA and other types of linear transforms is that the transformation depends on the inherent structure of the data Hence, PCA can achieve optimal performance in terms of energy compaction The generalized
Trang 6Hebbian learning neural network structure can be regarded as an online approximation of PCA, and hence can be applied to tasks that would require PCA
1.3.1.4 Pattern Classification
Pattern classification is perhaps the most important application of artificial neural networks
In fact, a majority of neural network applications can be categorized as solving complex pattern classification problems In the area of signal processing, pattern classification has been employed in speech recognition, optical (handwritten) character recognition, bar code recognition, human face recognition, fingerprint recognition, radar/sonar target identification, biomedical signal diagnosis, and numerous other areas
Given a set of feature vectors {x; x ∈ n} of an object of interest, we assume that the (probabilistic)
state of nature of each object can be designated with a label ω ∈ 8, where 8 is the set of all possible labels We denote the prior probability p(ω) to be the probability that a feature vector is assigned
by nature of the object to the label ω c We may also define a posterior probability p(ω|x) to be the probability that a feature vector x has label ω cgiven the observation of the feature vector x.
A minimum error statistical pattern classifier is one that maps each feature vector x to an element
in 8 such that the probability that the mapped label is different from the label assigned by the nature
of the object (the probability of misclassification) is minimized To achieve this minimum error rate,
for a given feature vector x, one must
Decide x has label ω i if p(ω i |x) > p(ω j |x) for j = i, ω i , ω j ∈ 8
In practice, it is very difficult to evaluate the posterior probability in close form Instead, one may
use an appropriate discriminant function gi (x) that satisfies
gi (x) > gj (x) if p(ωi |x) > p(ωj |x) for j = i, ωi , ωj ∈ 8
Then, the minimum error pattern classification can be achieved by
Decide x has label ωi if gi (x) > gj (x) for j = i, ωi , ∈ 8
The minimum probability of misclassification is also known as the Bayes error, and a minimum error
classifier is also known as a maximum a posteriori probability (MAP) classifier.
In applying the MAP classifier to real world applications, one must find an estimate of the posterior
probability p(ω|x) or, equivalently, a discriminant function g(x) based on a set of training data Thus,
a neural network such as the multilayer perceptron can be a good candidate for such a purpose A support vector machine is another neural network structure that directly estimates a discriminant function
One may apply the Bayes rule to express the posterior probability as:
p(ω|x) = p(x|ω)p(ω)/p(x) where p(x|ω) is called the likelihood function, p(ω) is the prior probability distribution of class label
ω, and p(x) is the marginal probability distribution of the feature vector x Since p(x) is independent
of ω i, the MAP decision rule can be expressed as:
Decide x has label ω i if p(x|ω i )p(ω i ) > p(x|ω j )p(ω j ) for j = i, ω i , ω j ∈ 8 p(ω i ) can be estimated from the training data samples as the percentage of training samples that are labeled ω i Thus, only the likelihood function needs to be estimated One popular model for such a purpose is a mixture of the Gaussian model:
p (x|ωi ) =
K i
k=1
ν kiexp − (x − mki )2/2σ2
ki
.
Trang 7To deduce the model parameters, {(νki , mki, σ2
ki ); 1 ≤ k ≤ Ki , 1 ≤ i ≤ C} (C = |8|) Obviously, a
radial basis neural network structure will be handy here to model the mixture of Gaussian likelihood function
Since the weighted sum of the mixture of Gaussian density functions is still a mixture of a Gaussian
density function, one may choose instead to model the marginal distribution p(x) with a mixture of
a Gaussian model Each individual Gaussian density function in the mixture model will be assigned
to a particular class label based on a majority voting of training samples assigned to that particular Gaussian density function Additional fine-tuning can be applied to enhance the probability of classification This is the approach implemented in the learning vector quantization (LVQ) neural network The above discussion is summarized in Table 1.5
TABLE 1.5 Pattern Classification Methods and Corresponding Neural
Network Implementations
Pattern Classification Methods Neural Network
Implementations
MAP: maximize posterior probability p(ω|x) Multilayer perceptron
MAP: maximize discriminant function g(x) Support vector machine ML: maximize product of likelihood function and prior Radial basis network, LVQ
distribution p(x|ω)p(ω)
1.3.1.5 Detection
Detection can be regarded as a special case of pattern classification where only two class labels are used: detect or no-detect The purpose of signal detection is to detect the presence of a known
signal in the presence of additive noise It is assumed that the received signal (often a vector) x may consist of the true signal vector s and an additive statistical noise vector n:
x = s + n
or simply the noise vector:
x = n
Assuming that the probability density function of the noise vector n is known, one may apply statistical hypothesis testing procedure to determine whether x contains the known signal s For
example, we may calculate the log-likelihood function and compare it to a predefined threshold in order to maximize the probability of detection subject to an upper bound of a prespecified false alarm rate
One popular assumption is that the noise vector n has a multivariate Gaussian distribution with zero mean and known covariance matrix In this case, the inner product sTx is a sufficient statistic,
known as a matched filter signal detector
A single neuron perceptron can be used to implement the matched filter computation The signal
template s will be the weight vector, and the observation x is applied as its input The bias term is
threshold, and the output = 1 if the presence of the signal is detected A multilayer perceptron can also be used to implement a nonlinear matched filter if the output activation function is a threshold function By the same token, a support vector machine is also a plausible neural network structure
to realize a nonlinear matched filter
1.3.1.6 Time Series Modeling
A time series is a sequence of readings as a function of time It arises in numerous practical applications, including stock prices, weather readings (e.g., temperature), utility demand, etc A
Trang 8central issue in time series modeling is to predict the future time series outcomes There are three
different ways of predicting a time series {y(t)}:
1 Predicting y(t) based on past observations {y(t − 1), y(t − 2), } That is,
ˆy(t) = E{y(t)|y(t − 1), y(t − 2), }
2 Predicting y(t) based on observation of other relevant time series {x(t); x(t), x(t − 1), }:
ˆy(t) = E{y(t)|x(t), x(t − 1), x(t − 2), }
3 Predicting y(t + 1) based on both {y(t − k); k = 1, 2, } and {x(t − m); m = 0, 1, 2, }:
ˆy(t) = E{y(t)|x(t), x(t − 1), x(t − 2), , y(t − 1), y(t − 2), }
Both {x(t)} and {y(t)} can be vector valued time series If the conditional expectation is a linear
function, then these formulae lead to three popular linear time series models:
Auto-regressive (AR) y(t) =N
k=1
a(k)y(t − k) + e(t)
Moving average (MA) y(t) =M
m=0
b(m)x(t − m)
Auto-regressive moving average (ARMA) y(t) =M
m=0
b(m)x(t − m) +N
k=1
a(k)y(t − k) + e(t)
In the AR and ARMA models, e(t) is a zero-mean, uncorrelated innovation process representing
a random persistent excitation of the system Neural network models can be incorporated into these time series models to facilitate nonlinear time series prediction Specifically, one may use the
generalized state vector s as an input to a neural network and obtain the output y(t) from the output
of the neural network
One such example is the time-delayed neural network (TDNN) that can be described as:
y(n) = ϕ(x(n), x(n − 1), , x(n − M)) ϕ(•) is a nonlinear transformation of its arguments, and it is implemented with a multilayer perceptron
in TDNN
1.3.1.7 System Identification
System identification is a modeling problem Given a black box system, the goal of system identification is to develop a mathematical model to describe the relation between the input and output of the unknown system
If the system under consideration is memoryless, the implication is that the output of this system
is a function of present input only and bears no relation to past input In this situation, the system identification problem becomes a function approximation problem
1.3.1.7.1 Function Approximation
Assume a set of training samples {(u(i), y(i))}, where u(i) is the input vector and y(i) is the
output vector The purpose of function approximation is to identify a mapping from x to y, that is,
y = ϕ(u) such that the expected sum of square approximation error E{|y − ϕ(u)|2} is minimized
Neural network structures such as the multilayer perceptron and radial basis network are both
good candidate algorithms to realize the ϕ(u) function.
Trang 91.3.1.7.2 Dynamic System Identification
If the system to be identified is a dynamic system, then the present input u(t) alone is not sufficient
to determine the output y(t) Instead, y(t) will be a function of both u(t) and a present state vector x(t) The state vector can be regarded as a summary of all the input in the past Unfortunately, for
many systems, only input and outputs are observable In this situation, previous outputs within a time window may be used as a generalized state vector
To derive the mapping from u(t) and x(t) to y(t), one may gather a sufficient amount of training data and then develop a mapping y(t) = ϕ(u(t), x(t)) using, for example, a linear model or a
nonlinear model such as an artificial neural network structure In practice, however, such training process is conducted using online learning This is illustrated in Figure 1.23
1.23 Illustration of online dynamic system identification The errore(t)is fed back to the model to update model parametersθ.
With online learning, the mathematical dynamic model receives the same inputs as the real,
unknown system, and produces an output ˆy(t) to approximate the true output y(t) The difference
between these two quantities will then be fed back to update the mathematical model
1.4 Overview of the Handbook
This handbook is organized into three complementary parts: neural network fundamentals, neural network solutions to statistical signal processing problems, and signal processing applications using neural networks In the first part, in-depth surveys of recent progress of neural network computing paradigms are presented Part One consists of five chapters:
• Chapter 1: Introduction to Neural Networks for Signal Processing This chapter has
provided an overview of topics discussed in this handbook so that the reader is better prepared for the in-depth discussion in later chapters
• Chapter 2: Signal Processing Using the Multilayer Perceptron In this chapter,
Manry, Chandrasekaran, and Hsieh discuss the training strategies of the multilayer per-ceptron and methods to estimate testing error from the training error A potential appli-cation of MLP to flight load synthesis is also presented
• Chapter 3: Radial Basis Functions In this chapter, Back presents a complete review
of the theory, algorithm, and five real world applications of radial basis network: time series modeling, option pricing in the financial market, phoneme classification, channel equalization, and symbolic signal processing
• Chapter 4: An Introduction to Kernel-Based Learning Algorithms In this chapter,
Müller, Mika, Rätsch, Tsuda, and Schölkopf introduce three important kernel-based
Trang 10learning algorithms: support vector machine, kernel Fisher discriminant analysis, and kernel PCA In addition to clear theoretical derivations, two impressive signal processing applications, optical character recognition and DNA sequencing analysis, are presented
• Chapter 5: Committee Machines Tresp gives three convincing arguments in this
chapter as to why a committee machine is important: (a) performance enhancement using averaging, bagging, and boosting; (b) modularity with a mixture of expert networks; and (c) computation complexity reduction as illustrated with the introduction of a Bayesian committee machine
The second part of this handbook surveys the neural network implementations of important signal processing problems These include the following chapters:
• Chapter 6: Dynamic Neural Networks and Optimal Signal Processing In this
chap-ter, Principe casts the problem of optimal signal processing in terms of a more general mathematical problem of function approximation Then, a general family of nonlinear filter structures, called a dynamic neural network, that consists of a bank of linear filters followed by static nonlinear operators, is presented Finally, a discussion of generalized delay operators is given
• Chapter 7: Blind Signal Separation and Blind Deconvolution In this chapter,
Dou-glas discusses the recent progress of blind signal separation and blind deconvolution Given two or more mixture signals, the purpose of blind separation and deconvolution is
to identify the independent components in a statistical mixture of the signal
• Chapter 8: Neural Networks and Principal Component Analysis In this chapter,
Diamantaras presents a detailed survey on using neural network Hebbian learning to realize principal component analysis (PCA) Also discussed in this chapter is nonlinear principal component analysis as an extension of the conventional PCA
• Chapter 9: Applications of Artificial Neural Networks to Time Series Prediction.
In this chapter, Liao, Moody, and Wu provide a technical overview of neural network approaches to time series prediction problems Three techniques — sensitivity-based in-put selection and pruning, constructing a committee prediction model using inin-put feature grouping, and smoothing regularization for recurrent neural networks — are reviewed, and applications to financial time series prediction are discussed
The last part of this handbook examines signal processing applications and systems that use neural network methods The chapters in this part include:
• Chapter 10: Applications of ANNs to Speech Processing Katagiri surveys the recent
work in applying neural network techniques to aid speech processing tasks Four topics are discussed: (a) the generalized gradient descent learning method, (b) recurrent neural networks, (c) support vector machines, and (c) signal separation techniques Instead
of just introducing these techniques, the focus is on how to apply them to enhance the performance of current speech processing systems
• Chapter 11: Learning and Adaptive Characterization of Visual Content in Image Retrieval Systems In this chapter, Muneesawang, Wong, Lay, and Guan discuss the
application of a radial basis network to adaptively characterize the similarity of image content to support content-based image retrieval in modern multimedia signal processing systems
• Chapter 12: Applications of Neural Networks to Biomedical Image Processing In
this chapter, Adali, Wang, and Li summarize recent progress in applying neural net-works to biomedical image processing Two specific areas, image analysis and computer assisted diagnosis, are discussed in great detail
... application of artificial neural networksIn fact, a majority of neural network applications can be categorized as solving complex pattern classification problems In the area of signal processing, ...
1.4 Overview of the Handbook< /b>
This handbook is organized into three complementary parts: neural network fundamentals, neural network solutions to statistical signal processing problems,... problems, and signal processing applications using neural networks In the first part, in-depth surveys of recent progress of neural network computing paradigms are presented Part One consists of five