RVMs allow to estimate channel parameters as well as to assess the number of multipath components constituting the channel within the Bayesian framework by locally maximizingthe evidence
Trang 1Volume 2007, Article ID 79821, 23 pages
doi:10.1155/2007/79821
Research Article
Application of the Evidence Procedure to
the Estimation of Wireless Channels
Dmitriy Shutin, 1 Gernot Kubin, 1 and Bernard H Fleury 2, 3
1 Signal Processing and Speech Communication Laboratory, Graz University of Technology, 8010 Graz, Austria
2 Institute of Electronic Systems, Aalborg University, Fredrik Bajers Vej 7A, 9220 Aalborg, Denmark
3 Forschungszentrum Telekommunikation Wien (ftw.), Donau City Strasse 1, 1220 Wien, Austria
Received 5 November 2006; Accepted 8 March 2007
Recommended by Sven Nordholm
We address the application of the Bayesian evidence procedure to the estimation of wireless channels The proposed scheme isbased on relevance vector machines (RVM) originally proposed by M Tipping RVMs allow to estimate channel parameters as well
as to assess the number of multipath components constituting the channel within the Bayesian framework by locally maximizingthe evidence integral We show that, in the case of channel sounding using pulse-compression techniques, it is possible to cast thechannel model as a general linear model, thus allowing RVM methods to be applied We extend the original RVM algorithm to themultiple-observation/multiple-sensor scenario by proposing a new graphical model to represent multipath components Throughthe analysis of the evidence procedure we develop a thresholding algorithm that is used in estimating the number of components
We also discuss the relationship of the evidence procedure to the standard minimum description length (MDL) criterion We showthat the maximum of the evidence corresponds to the minimum of the MDL criterion The applicability of the proposed scheme isdemonstrated with synthetic as well as real-world channel measurements, and a performance increase over the conventional MDLcriterion applied to maximum-likelihood estimates of the channel parameters is observed
Copyright © 2007 Dmitriy Shutin et al This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
1 INTRODUCTION
Deep understanding of wireless channels is an essential
pre-requisite to satisfy the ever-growing demand for fast
infor-mation access over wireless systems A wireless channel
con-tains explicitly or implicitly all the information about the
propagation environment To ensure reliable
communica-tion, the transceiver should be constantly aware of the
nel state In order to make this task feasible, accurate
nel models, which reproduce in a realistic manner the
chan-nel behavior, are required However, efficient joint
estima-tion of the channel parameters, for example, number of the
multipath components (model order), their relative delays,
Doppler frequencies, directions of the impinging wavefronts,
and polarizations, is a particularly difficult task It often
leads to analytically intractable and computationally very
ex-pensive optimization procedures The problem is often
re-laxed by assuming that the number of multipath
compo-nents is fixed, which simplifies optimization in many cases
[1,2] However, both underspecifying and overspecifying the
model order leads to significant performance degradation:
residual intersymbol interference impairs the performance
of the decoder in the former case, while additive noise isinjected in the channel equalizer in the latter: the excessivecomponents amount only to the random fluctuations of thebackground noise To amend this situation, empirical meth-ods like cross-validation can be employed (see, e.g., [3]).Cross-validation selects the optimal model by measuring itsperformance over a validation data set and selecting the onethat performs the best In case of practical multipath chan-nels, such data sets are often unavailable due to the time-variability of the channel impulse responses Alternatively,one can employ model selection schemes in the spirit of Ock-ham’s razor principle: simple models (in terms of the num-ber of parameters involved) are preferred over more complexones Examples are the Akaike information criterion (AIC)and minimum description length (MDL) [4,5] In this pa-per, we show how the Ockham principle can be effectivelyused to perform estimation of the channel parameters cou-pled with estimating the model order, that is, the number ofwavefronts
Consider a certain class of parametric models ses)Hidefined as the collection of prior distributionsp(w i |
(hypothe-Hi) for the model parameters wi Given the measurement
Trang 2dataZ and a family of conditional distributions p(Z | wi,
Hi), our goal is to infer the hypothesisH and the corre-
sponding parametersw that maximize the posterior
The key to solving (1) lies in inferring the
correspond-ing parameters wi andHi from the dataZ, which is
of-ten a nontrivial task As far as the Bayesian
methodol-ogy is concerned, there are two ways this inference
prob-lem can be solved [6, Section 5] In the joint estimation
to the quantities of interest wi andHi This often leads to
computationally-intractable optimization algorithms
Alter-natively, one can rewrite the posteriorp(w i,Hi |Z) as
and maximize each term on the right-hand side sequentially
from right to left This approach is known as the marginal
es-timation method Marginal eses-timation methods (MEM) are
well exemplified by expectation-maximization (EM)
algo-rithms and used in many different signal processing
appli-cations (see [2,3,7]) MEMs are usually easier to compute,
however they are prone to land in a local rather than global
optimum We recognize the first factor on the right-hand
side of (2) as a parameter posterior, while the other one is a
posterior for different model hypotheses It is the
maximiza-tion ofp(H i |Z) that guides our model selection decision
Then, the data analysis consists of two steps [8, Chapter 28],
≡Evidence×Hypothesis Prior. (4)
In the second stage,p(H i) measures our subjective prior over
different hypotheses before the data is observed In many
cases it is reasonable to assign equal probabilities to di
ffer-ent hypotheses, thus reducing the hypothesis selection to
se-lecting the model with the highest evidencep(Z |Hi).1The
evidence can be expressed as the following integral:
develop-or evidence procedure (EP) [13,14]
Equations (3), (4), and (5) form the theoretical work for our joint model and parameter estimation The esti-mation algorithm is based on relevance vector machines Rel-evance vector machines (RVM), originally proposed by Tip-ping [15], are an example of the marginal estimation methodthat, for a set of hypothesesHi, iteratively approximates (1)
frame-by alternating between the model selection, that is, mizing (5) with respect toHi, and inferring the correspond-ing model parameters from maximization of (3) RVMs havebeen initially proposed to find sparse solutions to general lin-ear problems However, they can be quite effectively adapted
maxi-to the estimation of the impulse response of wireless nels, thus resulting in an effective channel parameter estima-tion and model selection scheme within the Bayesian frame-work
chan-The material presented in the paper is organized as lows: Section 2 introduces the signal model of the wire-less channel and the used notation; Section 3 explains theframework of the EP in the context of wireless channels InSection 4 we explain how model selection is implementedwithin the presented framework and discuss the relation-ship between the EP and the MDL criterion for model se-lection Finally,Section 5presents some application resultsillustrating the performance of the RVM-based estimator insynthetic as well as in actual wireless environments
fol-2 CHANNEL ESTIMATION USING PULSE-COMPRESSION TECHNIQUE
Channel estimation usually consists of two steps: (1) ing a specific sounding sequence s(t) through the channel
send-and observing the responsey(t) at the other end, and (2)
es-timating the channel parameters from the matched-filteredreceived signal z(t) (Figure 1) It is common to representthe multipath channel response as the sum of delayed andweighted Dirac impulses, with each impulse representing oneindividual multipath component (see, e.g., [16, Section 5]).Such special structure of the channel impulse response im-plies that the filtered signalz(t) should have a sparse struc-
ture Unfortunately, this sparse structure is often obscured
by additive noise and temporal dispersion due to the finitebandwidth of the transmitter and receiver hardwares This
Trang 3Figure 2: Sounding sequences(t).
motivates the application of algorithms capable of
recover-ing this sparse structure from the measurement data
Let us consider an equivalent baseband channel
sound-ing scheme shown in Figure 1 The sounding signal s(t)
(Figure 2) consists of periodically repeated burst waveforms
T u ≤ T f and is formed asu(t) = M −1
m =0b m p(t − mT p) Thesequenceb0· · · b M −1is the known sounding sequence con-
sisting ofM chips, and p(t) is the shaping pulse of duration
T p,MT p = T u Furthermore, we assume that the receiver
(Rx) is equipped with a planar antenna array consisting ofP
sensors located at positions s1, , s P ∈ R2with respect to an
arbitrary reference point Let us now assume that the
maxi-mum absolute Doppler frequency of the impinging waves is
much smaller than the inverse of a single burst duration 1/T u
This low Doppler frequency assumption is equivalent to
as-suming that, within a single observation window equivalent
to the period of the sounding sequence, we can safely neglect
the influence of the Doppler shifts
The received signal vector y(t) ∈ C P ×1for a single burst
Here, a l, τ l, and ν l are respectively the complex gain,
the delay, and the Doppler shift of the lth multipath
component The P-dimensional complex vector c(φ l) =
ar-ray Provided the coupling between the elements can
be neglected, its components are given as c p(φ l) =
f p(φ l) exp(j2πλ −1e(φ l), sp ) with λ, e(φ l) and f p(φ l)
denot-ing the wavelength, the unit vector in R2 pointing in the
direction of the incoming wavefront determined by the
az-imuthφ l, and the complex electric field pattern of the pth
sensor, respectively The additive term η(t) ∈ C P ×1 is a
vector-valued complex white Gaussian noise process, that is,
the components ofη(t) are independent complex Gaussian
processes with double-sided spectral densityN0
The receiver front-end consists of a matched filter (MF)
matched to the transmitted sequence u(t) Under the low
Doppler frequency assumption the term ej2πν l t stays
time-invariant within a single burst duration, that is, equal to a
complex constant that can be incorporated in the complex
gaina l The signal z(t) at the output of the MF is then given
as
z(t) = L
t)dt is a spatially whiteP-dimensional vector with each
ele-ment being a zero-mean wide-sense stationary (WSS) sian noise with autocorrelation function
and delayed kernel functionsR uu(t − τ l), weighted across
sen-sors as given by the components of c(φ l) and observed in thepresence of the colored noiseξ(t).
In practice, however, the output of the MF is sampledwith the sampling periodT s ≤ T p, resulting inPN-tuples of
the MF output, whereN is the number of MF output
sam-ples By collecting the output of each sensor into a vector, wecan rewrite (7) in a vector form:
zp =Kwp+ξ p, p =1· · · P, (9)where we have defined
The additive noise vectorsξ p, p = 1· · · P, possess the
following properties that will be exploited later:
ξ p=0, E
ξ m ξ H k
=0, form / = k, (11)
ξ p ξ H p
also called the design matrix, accumulates the shifted andsampled versions of the kernel function R uu(t) It is con-
structed as K = [r1, , r L], with rl = [R uu(−τ l),R uu(T s −
In general, the channel estimation problem is posed as
follows: given the measured sampled signals zp,p =1· · · P,
determine the orderL of the model and estimate optimally
(with respect to some quality criterion) all multipath etersa l,τ l, andφ l, forl =1· · · L In this contribution, we re-
param-strict ourselves to the estimation of the model orderL along
with the vector wp, rather than of the constituting tersτ l,φ l, anda l We will also quantize, although arbitrarilyfine,2 the search space for the multipath delaysτ l Thus, we
parame-2 There is actually a limit beyond which it makes no sense to make the search grid finer, since it will not decrease the variance of the estimates, which is lower-bounded by the Crammer-Rao bound [ 2 ].
Trang 4do not try to estimate the path delays with infinite
resolu-tion, but rather fix the delay values to be located on a grid
with a given mesh determining the quantization error The
size of the delay search spaceL0and the resulting quantized
delaysT = { T1, , T L0}form the initial model hypothesis
H0, which would manifest itself in theL0columns of the
de-sign matrix K This allows to formulate the channel
estima-tion problem as a standard linear problem to which the RVM
algorithm can be applied
As it can be seen, our idea lies in finding the closest
approximation of the continuous-time model (7) with the
discrete-time equivalent (9) By incorporating the model
se-lection in the analysis, we also strive to find the most
com-pact representation (in terms of the number of components),
while preserving good approximation quality Thus, our goal
is to estimate the channel parameters wpas well as to
deter-mine how many multipath componentsL ≤ L0are present in
the measured impulse response The application of the RVM
framework to solve this problem follows in the next section
3 EVIDENCE MAXIMIZATION, RELEVANCE VECTOR
MACHINES, AND WIRELESS CHANNELS
We begin our analysis following the steps outlined in
Section 1 In order to ease the algorithm description we first
assume thatP =1, that is, only a single sensor is used
Ex-tensions to the caseP > 1 are carried out later inSection 3.2
To simplify the notations we also drop the subscript indexp
in our further notations
From (9) it follows that the observation vector z is a
lin-ear combination of the vectors from the column-space of K,
weighted according to the parameters w and embedded in
the correlated noiseξ In order to correctly assess the order
of the model, it is imperative to take the noise process into
account It follows from (12) that the covariance matrix of
the noise is proportional to the unknown spectral heightN0,
which should therefore be estimated from the data Thus,
the model hypotheses Hi should include the term N0 In
the following analysis we assume thatβ = N −1 is
Gamma-distributed [15], with the corresponding probability density
function (pdf) given as
Γ(υ)β υ −1exp(−κβ), (13)
with parametersκ and υ predefined so that (13) accurately
reflects our a priori information aboutN0 In the absence of
any a priori knowledge one can make use of a
noninforma-tive (i.e., flat in the logarithmic domain) prior by fixing the
parameters to small valuesκ = υ =10−4[15] Furthermore,
to steer the model selection mechanism, we introduce an
ex-tra parameter (hyperparameter) α l,l = 1· · · L0, for each
column in K This parameter measures the contribution or
relevance of the corresponding weightw l in explaining the
data z from the likelihoodp(z |wi,Hi) This is achieved by
specifying the priorp(w | α) for the model weights:
High values of α l will render the contribution of the
cor-responding column in the matrix K “irrelevant,” since the
weight w l is likely to have a very small value (hence they
are termed relevance hyperparameters) This will enable us
to prune the model by setting the corresponding weightw l
to zero, thus effectively removing the corresponding columnfrom the matrix and the corresponding delayT lfrom the de-lay search spaceT We also see that α −1
l is nothing else asthe prior variance of the model weightw l Also note that theprior (14) implicitly assumes statistical independence of themultipath contributions
To complete the Bayesian framework, we also specify theprior over the hyperparameters Similarly to the noise con-tribution, we assume the hyperparametersα lto be Gamma-distributed with the corresponding pdf
ap-Now, let us define the hypothesisHimore formally Let
P (S) be a power set consisting of all possible subsets of basisvector indicesS = {1 · · · L0}, and i →P (i) the indexing of
P (S) such that P (0) = S Then for each index value i the
hypothesisHiis the setHi = { β; α j, j ∈ P (i)} Clearly, the
initial hypothesisH0 = { β; α j, j ∈S}includes all possiblepotential basis functions
Now we are ready to outline the learning algorithm that
estimates the model parameters w,β, and hyperparameters α
from the measurement data z.
The explicit dependence on the hypothesis indexi has been
dropped to simplify the notation We recognize that the firstterm p(w | z,α, β) in (16) is the weight posterior and theother one p(α, β | z) is the hypothesis posterior From this
point we can start with the Bayesian two-step analysis as hasbeen indicated before
Assuming the parametersα and β are known, estimation
of model parameters consists of finding values w that
max-imize p(w | z,α, β) Using Bayes’ rule we can rewrite this
posterior as
p(w |z,α, β) ∝ p(z |w,α, β)p(w | α, β). (17)Consider the Bayesian graphical model [17] inFigure 3.This graph captures the relationship between different vari-ables involved in (16) It is a useful tool to represent the de-pendencies among the variables involved in the analysis in
Trang 5It immediately follows from the structure of the graph in
Figure 3thatp(z |w,α, β) = p(z |w,β) and p(w | α, β) =
p(w | α), that is, z and α are conditionally independent given
w andβ, and w and β are conditionally independent given α.
Thus, (17) is equivalent to
p(w |z,α, β) ∝ p(z |w,β)p(w | α), (18)
where the second factor on the right-hand side is given in
(14) The first term is the likelihood of w andβ given the
data From (9) it follows that
−(z−Kw)H βΛ −1(z−Kw)
Since both right-hand factors in (18) are Gaussian densities,
p(w | z,α, β) is also a Gaussian density with the covariance
matrixΦ and mean μ given as
Φ=A +βK HΛ−1K−1
μ = βΦK HΛ−1z. (21)
The matrix A = diag(α) is a diagonal matrix that contains
the evidence parametersα lon its main diagonal Clearly,μ
is a maximum a-posteriori (MAP) estimate of the
parame-ter vector w under the hypothesisHi, withΦ being the
co-variance matrix of the resulting estimates This completes the
model fitting step
Our next step is to find parametersα and β that
maxi-mize the hypothesis posterior p(α, β |z) in (16) This
den-sity function can be represented as p(α, β | z) ∝ p(z |
α, β)p(α, β), where p(z | α, β) is the evidence term and
men-tioned earlier, it is quite reasonable to choose noninformative
priors since we would like to give all possible hypothesesHi
an equal chance of being valid This can be achieved by
concluded (see derivations in the appendix) that maximum
of the evidence p(z | α, β) coincides with the maximum of
p(z | α, β)p(α, β) when ζ = = κ = υ =0, which effectively
results in the noninformative hyperpriors forα and β.
This formulation of prior distributions is related to
au-tomatic relevance determination (ARD) [14,18] As a
con-sequence of this assumption, the maximization of the model
posterior is equivalent to the maximization of the evidence,which is known as the evidence procedure [13]
The evidence termp(z | α, β) can be expressed as
hyperparameters α and β is a type-II maximum likelihood
method [19] To ease the optimization, several terms in (22)can be expressed as a function of the weight posterior param-etersμ and Φ as given by (20) and (21) Then, by taking thederivatives of the logarithm of (22) with respect toα and β
and by setting them to zero, we obtain its maximizing values
as (see also the appendix)
In (23)μ landΦlldenote thelth element of, respectively, the
vectorμ, and the main diagonal of the matrix Φ Unlike the
maximizing values obtained in the original RVM paper [15,equation (18)], (24) is derived for the extended, more gen-eral case of colored additive noiseξ with the corresponding
covariance matrixβ −1Λ arising due to the MF processing at
the receiver Clearly, if the noise is assumed to be white, pressions (23) and (24) coincide with those derived in [15].Also note thatα and β are dependent as it can be seen from
ex-(23) and (24)
Thus, for a particular hypothesisHi the learning rithm proceeds by repeated application of (20) and (21), al-ternated with the update of the corresponding evidence pa-rametersα iandβ from (23) and (24), as depicted inFigure 4,until some suitable convergence criterion has been satisfied.Provided a good initialization ofα[0]
algo-i andβ[0]is chosen,3thescheme inFigure 4converges afterj iterations to the station-
ary point of the system of coupled equations (20), (21), (23),and (24) Then, the maximization (1) is performed by select-ing the hypothesis that results in the highest posterior (2)
In practice, however, we will observe that during the timation some of the hyperparametersα ldiverge, or, in fact,become numerically indistinguishable from infinity given thecomputer accuracy.4 The divergence of some of the hyper-parameters enables us to approximate (1) by performing an
rees-3 Later in Section 5 we consider several rules for initializing the rameters.
hyperpa-4 In the finite sample size case, however, this will only happen in the high SNR regime Otherwise,α lwill take large but still finite values In Section 4.1 we elaborate more on the conditions that lead to conver- gence/divergence of this learning scheme.
Trang 6Hypothesis Hi Parameter
posteriors
Eq (20), (21)
Hypothesis update
Figure 4: Iterative learning of the parameters; the superscript [j]
denotes the iteration index
on-line model selection: starting from the initial hypothesis
H0, we prune the hyperparameters that become larger than
a certain threshold as the iterations proceed by setting them
to infinity In turn, this sets the corresponding coefficient wl
to zero, thus “switching off” the lth column in the kernel
matrix K and removing the delayT lfrom the search space
T This effectively implements the model selection by
cre-ating smaller hypothesesHi < H0 (with fewer basis
func-tions) without performing an exhaustive search over all the
possibilities The choice of the threshold will be discussed in
Section 4
3.2 Extensions to multiple channel observations
In this subsection we extend the above analysis to multiple
channel observations or multiple antenna systems When
de-tecting multipath components any additional channel
mea-surement (either in time, by observing several periods of the
sounding sequenceu(t), or in space, by using multiple
sen-sor antenna) can be used to increase detection quality Of
course, it is important to make sure that the multipath
com-ponents are time-invariant within the observation interval
The basic idea how to incorporate several channel
observa-tions is quite simple: in the original formulation each
hyper-parameterα lwas used to control a single weightw land thus
the single component Having several channel observations,
a single hyperparameterα lnow controls weights
represent-ing contribution of the same physical multipath component,
but present in the different channel observations
Usage of a single parameter in this case expresses the
channel coherence property in the Bayesian framework The
corresponding graphical model that illustrates this idea for a
single hyperparameterα lis depicted inFigure 5 It is
inter-esting to note that similar ideas, though in a totally different
context, were adapted to train neural networks by allowing
a single hyperparameter to control a group of weights [18]
Note that it is also possible to introduce an individual
hyper-parameterα p,lfor each weightw p,l, but this eventually
decou-ples the problem intoP separate one-dimensional problems
and, as the result, any dependency between the consecutive
channels is ignored
Now, let us return to (9) It can be seen that the weights
wp capture the structure induced by multiple antennas
However, for the moment we ignore this structure and treat
the components of wp as a wide-sense stationary (WSS)
wire-process over the individual channels, p = 1· · · P We will
also allow each sensor to have a different MF This mightnot necessarily be the case for wireless channel sounding,but thus a more general situation can be considered Differ-ent matched filters result in different design matrices Kp, andthus different noise covariance matrices Σp,p =1· · · P We
will however require that the variance of the input noise mains the same and equalsN0= β −1for all channels, so that
re-Σ p= N0Λp, and the noise components are statistically pendent among the channels Then, by defining
wP
⎤
⎥
⎦,(25)
hy-in the structure of the matrixA This will have a correspond-
ing effect on the hyperparameter reestimation algorithm.From the structural equivalence of (9) and (26) we caneasily infer that (20) and (21) are modified as follows:
Trang 7The expressions for the hyperparameter updates become
a bit more complicated but are still straight-forward to
com-pute It is shown in the appendix that
whereμ p,lis thelth element of the MAP estimate of the
pa-rameter vector wpgiven by (28), andΦp,llis thelth element
on the main diagonal ofΦpfrom (27) Comparing the latter
expressions with those developed for the single channel case,
we observe that (29) and (30) use multiple channels to
im-prove the estimates of the noise spectral height and channel
weight hyperparameters They also offer more insight into
the physical meaning of the hyperparametersα On the one
hand, the hyperparameters are used to regularize the matrix
inversion (27), needed to obtain the MAP estimates of the
parametersw p,l and their corresponding variances On the
other hand, they act as the inverse of the second noncentral
moments of the coefficients w p,l, as can be seen from (29)
4 MODEL SELECTION AND BASIS PRUNING
The ability to select the best model to represent the
mea-sured data is an important feature of the proposed scheme,
and thus it is paramount to consider in more detail how the
model selection is effectively achieved InSection 3.1we have
briefly mentioned that during the learning phase many of
the hyperparametersα l’s tend to large values, meaning that
the corresponding weightsw l’s will cluster around zero
ac-cording to the prior (14) This will allow us to set these
co-efficients to zero, thus effectively pruning the corresponding
basis function from the design matrix However the question
how large a hyperparameter has to grow in order to prune
its corresponding basis function has not yet been discussed
In the original RVM paper [15], the author suggests using
a thresholdαthto prune the model The empirical evidence
collected by the author suggests setting the threshold to “a
sufficiently large number” (e.g., αth = 1012) However, our
theoretical analysis presented in the following section will
show that such high thresholds are only meaningful in very
high SNR regimes, or if the number of channel observations
P is sufficiently large In more general, and often more
realis-tic, scenarios such high thresholds are absolutely impractical
Thus, there is a need to study the model selection problem in
the context of the presented approach more rigorously
Below, we present two methods for implementing model
selection within the proposed algorithm The first method
relies on the statistical properties of the hyperparametersα l,
when the update equations (27), (28), (29), and (30)
con-verge to a stationary point The second method exploits the
relationship that we will establish between the proposedscheme and the minimum description length principle [4,8,
20,21], thus linking the EP to this classical model selectionapproach
4.1 Statistical analysis of the hyperparameters
in the stationary point
The decision to keep or to prune a basis function from the sign matrix is based purely on the value of the correspondinghyperparameterα l In the following we analyze the conver-gence properties of the iterative learning scheme depicted inFigure 4using expressions (27), (28), (29), and (30), and theresulting distribution of the hyperparameters once conver-gence is achieved
de-We start our analysis of the evidence parameters α l
by making some simplifications to make the derivationstractable
(ii) The same MF is used to process each of theP sensor
output signals, that is, Kp =K and Σp =Σ = β −1Λ,
p =1· · · P.
(iii) The noise covariance matrixΣ is known, and B=Σ−1.(iv) We assume the presence of a single multipath compo-nent, that is,L = 1, with known delay τ Thus, the
design matrix is given as K = [r(τ)], where r(τ) =
associated basis function
(v) The hyperparameter associated with this component
is denoted asα.
Our goal is to consider the steady-state solutionα ∞ forhyperparameterα in this simplified scenario In this case (27)and (28) simplify to
5 Recall thatα −1is the prior variance of the corresponding parameterw.
This constrainsα to be nonnegative.
Trang 8analysis of (32) reveals that (29) converges to (33) if and only
if the denominator of (33) is positive:
Otherwise, the iterative learning scheme depicted inFigure 4
diverges, that is,α ∞ = ∞ This can be inferred by interpreting
(29) as a nonlinear dynamic system that, at iterationj, maps
α[j −1]into the updated valueα[j] The nonlinear mapping is
given by the right-hand side of (29), where the quantitiesΦp
andμ p depend on the values of the hyperparameters at
it-eration j −1 InFigure 6we show several iterations of this
mapping that illustrate how the solution trajectories evolve
If condition (34) is satisfied, the sequence of solutions{ α[j] }
converges to a stationary point (Figure 6(a)) given by (33)
Otherwise,{ α[j] }diverges (Figure 6(b)) Thus, (32) is a
sta-tionary point only provided the condition (34) is satisfied:
Practically, this means that for a given measurement zp, and
known noise matrix B, we can immediately decide whether a
given basis function r(τ) should be included in the basis by
simply checking if (34) is satisfied or not
A similar analysis is performed in [22], where the
behav-ior of the likelihood function with respect to a single
parame-ter is studied The obtained convergence results coincide with
ours when P = 1 Expression (34) is, however, more
gen-eral and accounts for multiple channel observations and
col-ored noise In [22] the authors also suggest that testing (34)
for a given basis function r(τ) is sufficient to find a sparse
representation and no further pruning is necessary In other
words, each basis function in the design matrix K is subject
to the test (34) and, if the test fails, that is, (34) does not hold
for the basis function under test, the basis function is pruned
In case of wireless channels, however, we have
exper-imentally observed that even in simulated high-SNR
sce-narios such pruning results in a significantly overestimated
number of multipath components Moreover, it can be
in-ferred from (34) that, as the SNR increases, the number of
functions pruned with this approach decreases, resulting in
less and less sparse representations This motivates us to
per-form a more detailed analysis of (35)
Let us slightly modify the assumptions we made earlier
We now assume that the multipath delayτ is unknown The
design matrix is constructed similarly but this time K=[rl],
is the basis function associated with the delayT l ∈T used in
our discrete-time model Under these assumptions the input
signal zp is nothing else but the basis function r(τ) scaled
0 10 20 30 40 50 60 70 80
Figure 6: Evolution of the two representative solution trajectoriesfor two cases: (a){ α[j] }converges, (b){ α[j] }diverges
and embedded in the additive complex zero-mean Gaussiannoise with covariance matrixΣ, that is,
Let us further assume that w p ∈ C, p = 1· · · P, are
un-known but fixed complex scaling factors In further tions we assume, unless explicitly stated otherwise, that thecondition (34) is satisfied for the basis rl By plugging (37)
Trang 9deriva-into (33) and rearranging the result with respect toα −1
Now, we consider two scenarios In the first scenario τ =
T l ∈ T , that is, the discrete-time model matches the
ob-served signal Although unrealistic, this allows to study the
properties ofα −1
∞ more closely In the second scenario, we
study what happens if the discrete-time model does not
match perfectly the measured signal This case helps us to
define how the model selection rules have to be adjusted to
consider possible misalignment of the path component
de-lays in the model
where the only random quantity is the additive noise termξ p
This allows us to study the statistical properties of the finite
On the other hand, in the absence of noise, that is, in the
infinite SNR case, the corresponding hyperparameterα −1
∞ consists of the sum of two contributions
6 Actually, the second term in the resulting expression vanishes in a
per-fectly noise-free case, and thenα −1 = | w p |2/P.
We first considerα −1
s The first term on the right-handside of (41) is a deterministic quantity that equals the averagepower of the multipath component The second one, on theother hand, is random The product Re{w p ξ H
pBrl }in (41)
is recognized as the cross-correlation between the additive
noise term and the basis function rl It is Gaussian distributedwith expectation and variance given as
av-Now, let us consider the termα −1
n In (40) the only dom element is P p =1ξ p ξ H
ran-p This random matrix is known tohave a complex Wishart distribution [23,24] with the scalematrixΣ and P degrees of freedom Let us denote
c
Pe− x/σ c2. (46)The mean and the variance ofx are easily computed to be
Trang 10which is equivalent to (46), but shifted so as to correspond
to a zero-mean distribution However, it is known that only
positive values ofα −1
n occur in practice The probability mass
of the negative part of (48) equals the probability that the
condition (34) is not satisfied and the resultingα ∞eventually
diverges to infinity and is pruned Taking this into account
A closer look at (49) shows that asP increases the variance of
the Gamma distribution decreases, withα −1
n concentrating
at zero In the limiting case asP → ∞, (49) converges to a
Dirac delta function localized at zero, that is,αn = ∞ This
allows natural pruning of the corresponding basis function
This situation is equivalent to averaging out the noise, as the
number of channel observations grows Practically, however,
certain finite variance
The pruning problem can now be approached from the
perspective of classical detection theory To prune a basis
function, we have to decide if the corresponding value of
α −1has been generated by the noise distribution (49), that
is, the null hypothesis, or by the pdf of α −1
s +α −1
n , that is, the
problem might be somewhat relaxed by taking the
assump-tion thatα −1
s andα −1
n are statistically independent Howeverproving the plausibility of this assumption is difficult Even
if we were successful in finding the analytical expression for
the pdf of the alternative hypothesis, such model selection
approach is hampered by our inability to evaluate (43) since
the gainsw p’s are not known a priori However, we can still
use (49) to select a threshold
Recall that the presented algorithm allows to learn
(esti-mate) the noise spectral heightN0= β −1from the
measure-ments Assuming that we knowβ, and, as a consequence, the
whole matrix B then, for any basis function rlin the design
matrix K and the corresponding hyperparameterα l, we can
decide with a priori specified probabilityρ that α lis
gener-ated by the distribution (49) Indeed, letα −1
of (49) such that the probabilityP(α −1 ≤ α −th1) = ρ Since
(49) is known exactly, we can easily computeα −1
th and pruneall the basis functions for whichα −1
l ≤ α −1
th
The analysis performed above relies on the knowledge
that the true multipath delayτ belongs to T Unfortunately,
this is often unrealistic and the model mismatch τ / ∈ Tmust be considered To be able to study how the model mis-match influences the value of the hyperparameters we have tomake a few more assumptions Let us for simplicity select themodel delayT lto be a multiple of the chip periodT p We willalso need to assume a certain shape of the correlation func-
convenient to assume that the main lobe ofR uu(t) can be
ap-proximated by a raised cosine function with period 2T p Thisapproximation makes sense if the sounding pulse p(t) de-
fined inSection 2is a square root raised cosine pulse Clearly,this approximation can also be applied for other shapes of themain lobe, but the analysis of quality of such approximationremains outside the scope of this paper
Just as in the previous case, we can split the expression(38) into the multipath component contributionα −1
and the same noise contributionα −1
n defined in (40) It can
be seen that theγ(τ) makes (52) differ from (41), and as such
it is the key to the analysis of the model mismatch Note thatthis function is bounded as| γ(τ) | ≤1, with equality follow-ing only ifτ = T l Note also that in our case for| τ − T l | < T p
the correlationγ(τ) is strictly positive.
Due to the properties of the sounding sequenceu(t), the
magnitude ofR uu(t) for | t | > T p is sufficiently small and inour analysis of model mismatch can be safely assumed to be
zero Furthermore, if rlis chosen to coincide with the tiple of the sampling periodT l = lT s, then it follows from(12) that the product rH
mul-l B =rH
l Σ−1 = βe H
l is a vector withall elements being zero except thelth element, which is equal
a form identical to that of the correlation functionR uu(t) for
γ(τ) can be assumed to be zero, and it makes sense to
ana-lyze (52) only when| τ − T l | < T p InFigure 7we plot thecorrelation functionsR uu(t) and γ(τ) for this case.
Since the true value ofτ is unknown, we assume this
pa-rameter to be random, uniformly distributed in the interval[T l − T p,T l+T p] This in turn induces corresponding distri-butions for the random variablesγ(τ) and γ(τ)2, which enter,respectively, the second and first terms on the right-hand side
of (52)
It can be shown that in this case γ(τ) ∼ B(0.5, 0.5),
whereB(0.5, 0.5) is a Beta distribution [25] with both bution parameters equal to 1/2 The corresponding pdf p γ(x)
distri-is given in thdistri-is case as
−1/2(1− x) −1/2, (54)whereB( ·, ·) is a Beta-function [26] withB(0.5, 0.5) = π.
Trang 11It is also straight-forward to compute the pdf of the term
The corresponding empirical and theoretical pdf ’s of
Now we have to find out how this information can be
uti-lized to design an appropriate threshold In the case of a
per-fectly matched model the threshold is selected based on the
noise distribution (49) In the case of a model mismatch, the
term (52) measures the amount of the interference resulting
from the model imperfection
Indeed, if| τ − T l | ≥ T p, then the resulting γ(τ) = 0,
and thusα −1
s = 0 The corresponding evidence parameter
∞ is then equal to the noise contributionα −1
n only and will
be pruned using the method we described for the matched
Empiricalγ(x)2
p γ2 (x)
(b)
Figure 8: Comparison between the empirical and theoretical pdf ’s
of (a)γ(τ) and (b) γ(τ)2for the cosine approximation case To pute the histogramN =5000 samples were used
com-model case If however,| τ − T l | < T p, then a certain fraction
ofα −1
s will be added to the noise contributionα −1
n , thus ing the interference In order to be able to take this interfer-ence into account and adjust the threshold accordingly, wepropose the following approach
caus-The amount of interference added is measured by themagnitude ofα −1
s in (52) It consists of two terms: the firstone is the multipath power, scaled by the factorγ(τ)2: