STOCHASTIC APPROXIMATION ALGORITHMS WITH General Convergence Theorems by TS Method Convergence Under State-Independent Conditions Necessity of Noise Condition Non-Additive Noise Connecti
Trang 3Technical University Braunschweig, Germany
The titles published in this series are listed at the end of this volume.
Trang 4Stochastic Approximation and Its Applications
by
Han-Fu Chen
Institute of Systems Science,
Academy of Mathematics and System Science,
Chinese Academy of Sciences,
Beijing, P.R China
KLUWER ACADEMIC PUBLISHERS
NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW
Trang 5Print ISBN: 1-4020-0806-6
©200 3 Kluwer Academic Publishers
New York, Boston, Dordrecht, London, Moscow
Print ©2002 Kluwer Academic Publishers
All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Kluwer Online at: http://kluweronline.com
and Kluwer's eBookstore at: http://ebooks.kluweronline.com
Dordrecht
Trang 6Truncated RM Algorithm and TS Method
Weak Convergence Method
Notes and References
2 STOCHASTIC APPROXIMATION ALGORITHMS WITH
General Convergence Theorems by TS Method
Convergence Under State-Independent Conditions
Necessity of Noise Condition
Non-Additive Noise
Connection Between Trajectory Convergence and Property
of Limit Points
Robustness of Stochastic Approximation Algorithms
Dynamic Stochastic Approximation
Notes and References
3 ASYMPTOTIC PROPERTIES OF STOCHASTIC
Convergence Rate: Nondegenerate Case
Convergence Rate: Degenerate Case
Asymptotic Normality
v
ixxv12410162123
25262841454957678293
9596103113
Trang 73.5
Asymptotic Efficiency
Notes and References
4 OPTIMIZATION BY STOCHASTIC APPROXIMATION
Asymptotic Behavior of Global Optimization Algorithm
Application to Model Reduction
Notes and References
5 APPLICATION TO SIGNAL PROCESSING
Recursive Blind Identification
Principal Component Analysis
Recursive Blind Identification by PCA
Constrained Adaptive Filtering
Adaptive Filtering by Sign Algorithms
Asynchronous Stochastic Approximation
Notes and References
6 APPLICATION TO SYSTEMS AND CONTROL
Application to Identification and Adaptive Control
Application to Adaptive Stabilization
Application to Pole Assignment for Systems with UnknownCoefficients
Application to Adaptive Regulation
Notes and References
Convergence Theorems for Martingale
Convergence Theorems for MDS I
Borel-Cantelli-Lévy Lemma
130149151153166172194210218219220238246265273278288289290305316321327329329329330330331332333333335335339340
Trang 8B.5
B.6
Convergence Criteria for Adapted Sequences
Convergence Theorems for MDS II
Weighted Sum of MDS
References
Index
341343344347355
Trang 9Estimating unknown parameters based on observation data ing information about the parameters is ubiquitous in diverse areas ofboth theory and application For example, in system identification theunknown system coefficients are estimated on the basis of input-outputdata of the control system; in adaptive control systems the adaptivecontrol gain should be defined based on observation data in such a waythat the gain asymptotically tends to the optimal one; in blind chan-nel identification the channel coefficients are estimated using the outputdata obtained at the receiver; in signal processing the optimal weightingmatrix is estimated on the basis of observations; in pattern classifica-tion the parameters specifying the partition hyperplane are searched bylearning, and more examples may be added to this list.
contain-All these parameter estimation problems can be transformed to aroot-seeking problem for an unknown function To see this, let de-note the observation at time i.e., the information available about theunknown parameters at time It can be assumed that the parameterunder estimation denoted by is a root of some unknown function
This is not a restriction, because, for example, mayserve as such a function Let be the estimate for at time Thenthe available information at time can formally be written as
where
Therefore, by considering as an observation on at withobservation error the problem has been reduced to seeking theroot of based on
It is clear that for each problem to specify is of crucial importance.The parameter estimation problem is possible to be solved only if
ix
Trang 10is appropriately selected so that the observation error meets therequirements figured in convergence theorems.
If and its gradient can be observed without error at any desiredvalues, then numerical methods such as Newton-Raphson method amongothers can be applied to solving the problem However, this kind ofmethods cannot be used here, because in addition to the obvious problemconcerning the existence and availability of the gradient, the observationsare corrupted by errors which may contain not only the purely randomcomponent but also the structural error caused by inadequacy of theselected
Aiming at solving the stated problem, Robbins and Monro proposedthe following recursive algorithm
to approximate the sought-for root where is the step size Thisalgorithm is now called the Robbins-Monro (RM) algorithm Follow-ing this pioneer work of stochastic approximation, there have been alarge amount of applications to practical problems and research works
on theoretical issues
At beginning, the probabilistic method was the main tool in vergence analysis for stochastic approximation algorithms, and ratherrestrictive conditions were imposed on both and For example,
con-it is required that the growth rate of is not faster than linear astends to infinity and is a martingale difference sequence [78].Though the linear growth rate condition is restrictive, as shown by sim-ulation it can hardly be simply removed without violating convergencefor RM algorithms
To weaken the noise conditions guaranteeing convergence of the rithm, the ODE (ordinary differential equation) method was introduced
algo-in [72, 73] and further developed algo-in [65] Salgo-ince the conditions on noiserequired by the ODE method may be satisfied by a large class of
including both random and structural errors, the ODE method has beenwidely applied for convergence analysis in different areas However, in
this approach one has to a priori assume that the sequence of estimates
is bounded It is hard to say that the boundedness assumption ismore desirable than a growth rate restriction on
The stochastic approximation algorithm with expanding truncationswas introduced in [27], and the analysis method has then been improved
in [14] In fact, this is an RM algorithm truncated at expanding bounds,and for its convergence the growth rate restriction on is not re-quired The convergence analysis method for the proposed algorithm
is called the trajectory-subsequence (TS) method, because the analysis
Trang 11is carried out at trajectories where the noise condition is satisfied and
in contrast to the ODE method the noise condition need not be fied on the whole sequence but is verified only along convergentsubsequences This makes a great difference when dealing withthe state-dependent noise because a convergent subsequence
veri-is always bounded while the boundedness of the whole sequence
is not guaranteed before establishing its convergence As shown inChapters 4, 5, and 6 for most of parameter estimation problems aftertransforming them to a root-seeking problem, the structural errors areunavoidable, and they are state-dependent
The expanding truncation technique equipped with TS method pears a powerful tool in dealing with various parameter estimation prob-lems: it not only has succeeded in essentially weakening conditions forconvergence of the general stochastic approximation algorithm but alsohas made stochastic approximation possible to be successfully applied indiverse areas However, there is a lack of a reference that systematicallydescribes the theoretical part of the method and concretely shows theway how to apply the method to problems coming from different areas
ap-To fill in the gap is the purpose of the book
The book summarizes results on the topic mostly distributed overjournal papers and partly contained in unpublished material The book
is written in a systematical way: it starts with a general introduction
to stochastic approximation and then describes the basic method used
in the book, proves the general convergence theorems and demonstratesvarious applications of the general theory
In Chapter 1 the problem of stochastic approximation is stated andthe basic methods for convergence analysis such as probabilistic method,ODE method, TS method, and the weak convergence method are intro-duced
Chapter 2 presents the theoretical foundation of the algorithm withexpanding truncations: the basic convergence theorems are proved by
TS method; various types of noises are discussed; the necessity of theimposed noise condition is shown; the connection between stability ofthe equilibrium and convergence of the algorithm is discussed; the ro-bustness of stochastic approximation algorithms is considered when thecommonly used conditions deviate from the exact satisfaction, and themoving root tracking is also investigated The basic convergence the-orems are presented in Section 2.2, and their proof is elementary andpurely deterministic
Chapter 3 describes asymptotic properties of the algorithms: gence rates for both cases whether or not the gradient of is degener-
Trang 12conver-ate; asymptotic normality of and asymptotic efficiency by averagingmethod.
Starting from Chapter 4 the general theory developed so far is plied to different fields Chapter 4 deals with optimization by usingstochastic approximation methods Convergence and convergence rates
ap-of the Kiefer-Wolfowitz (KW) algorithm with expanding truncations andrandomized differences are established A global optimization methodconsisting in combination of the KW algorithms with search methods isdefined, and its a.s convergence as well as asymptotic behaviors are es-tablished Finally, the global optimization method is applied to solvingthe model reduction problem
In Chapter 5 the general theory is applied to the problems arisingfrom signal processing Applying the stochastic approximation method
to blind channel identification leads to a recursive algorithm estimatingthe channel coefficients and continuously improving the estimates whilereceiving new signal in contrast to the existing “block” algorithms Ap-plying TS method to principal component analysis results in improvingconditions for convergence Stochastic approximation algorithms withexpanding truncations with TS method are also applied to adaptive fil-ters with and without constraints As a result, conditions required forconvergence have been considerably improved in comparison with theexisting results Finally, the expanding truncation technique and TSmethod are applied to the asynchronous stochastic approximation
In the last chapter, the general theory is applied to problems arisingfrom systems and control The ideal parameter for operation is identifiedfor stochastic systems by using the methods developed in this book.Then the obtained results are applied to the adaptive quadratic controlproblem Adaptive regulation for a nonlinear nonparametric system andlearning pole assignment are also solved by the stochastic approximationmethod
The book is self-contained in the sense that there are only a few pointsusing knowledge for which we refer to other sources, and these points can
be ignored when reading the main body of the book The basic matical tools used in the book are calculus and linear algebra based onwhich one will have no difficulty to read the fundamental convergenceTheorems 2.2.1 and 2.2.2 and their applications described in the sub-sequent chapters To understand other material, probability concept,especially the convergence theorems for martingale difference sequencesare needed Necessary concept of probability theory is given in Appendix
mathe-A Some facts from probability that are used at a few specific points arelisted in Appendix A but without proof, because omitting the corre-sponding parts still makes the rest of the book readable However, the
Trang 13proof of convergence theorems for martingales and martingale differencesequences is provided in detail in Appendix B.
The book is written for students, engineers and researchers working inthe areas of systems and control, communication and signal processing,optimization and operation research, and mathematical statistics
HAN-FU CHEN
Trang 14The support of the National Key Project of China and the NationalNatural Science Foundation of China is gratefully acknowledged Theauthor would like to express his gratitude to Dr Haitao Fang for hishelpful suggestions and useful discussions The author would also like
to thank Ms Jinling Chang for her skilled typing and to thank my wifeShujun Wang for her constant support
xv
Trang 15ROBBINS-MONRO ALGORITHM
Optimization is ubiquitous in various research and application fields
It is quite often that an optimization problem can be reduced to findingzeros (roots) of an unknown function which can be observed butthe observation may be corrupted by errors This is the topic of stochas-tic approximation (SA) The error source may be observation noise, butmay also come from structural inaccuracy of the observed function Forexample, one wants to find zeros of but he actually observes func-tions which are different from Let us denote by theobservation at time the observation noise:
Here, is the additional error caused by the structural accuracy It is worth noting that the structural error normally depends
in-on and it is hard to require it to have a certain probabilistic propertysuch as independence, stationarity or martingale property We call thiskind of noises as state-dependent noise
The basic recursive algorithm for finding roots of an unknown function
on the basis of noisy observations is the Robbins-Monro (RM) algorithm,which is characterized by its simplicity in computation This chapterserves as an introduction to SA, describing various methods for analyzingconvergence of the RM algorithm
In Section 1.1 the motivation of RM algorithm is explained, and itslimitation is pointed out by an example In Section 1.2 the classicalapproach to analyzing convergence of RM algorithm is presented, which
is based on probabilistic assumptions on the observation noise To relaxrestrictions made on the noise, a convergence analysis method connectingconvergence of the RM algorithm with stability of an ordinary differential
1
Trang 16equation (ODE) was introduced in nineteen seventies The ODE method
is demonstrated in Section 1.3 In Section 1.4 the convergence analysis
is carried out at a sample path by considering convergent subsequences
So, we call this method as Trajectory-Subsequence (TS) method, which
is the basic tool used in the subsequent chapters
In this book our main concern is the path-wise convergence of the
algorithm However, there is another approach to convergence
analy-sis called the weak convergence method, which is briefly introduced in
Section 1.5 Notes and references are given in the last section
This chapter introduces main methods used in literature for
conver-gence analysis, but restricted to the single root case Extension to more
general cases in various aspects is given in later chapters
Many theoretical and practical problems in diverse areas can be
re-duced to finding zeros of a function To see this it suffices to notice that
solving many problems finally consists in optimizing some function
i.e., finding its minimum (or maximum) If is differentiable, then
the optimization problem reduces to finding the roots of where
the derivative of
In the case where the function or its derivatives can be observed
without errors, there are many numerical methods for solving the
prob-lem For example, the gradient method, by which the estimate for
the root of is recursively generated by the following algorithm
where denotes the derivative of This kind of problems belongs
to the topics of optimization theory, which considers general cases where
may be nonconvex, nonsmooth, and with constraints
In contrast to the optimization theory, SA is devoted to finding zeros
of an unknown function which can be observed, but the observations
are corrupted by errors
Since is not exactly known and even may not exist,
(1.1.1)-like algorithms are no longer applicable Consider the following simple
example Let be a linear function
If the derivative of is available, i.e., if we know and ifcan precisely be observed, then according to (1.1.1)
Trang 17This means that the gradient algorithm leads to the zero of
Let us consider the case where is observed with errors:
where denotes the observation at time the correspondingobservation error and the estimate for the root of at time
It is natural to ask, how will behave if the exact value of
in (1.1.2) is replaced by its error-corrupted observation i.e., if
is recursively derived according to the following algorithm:
In our example, and (1.1.5) turns to be
Trang 18Similar to (1.1.3), the solution of this difference equation is
Therefore, converges to the root of if tends
to zero as This means that replacement of gradient by asequence of numbers still works even in the case oferror-corrupted observations, if the observation errors can be averagedout It is worth noting that in lieu of (1.1.5) we have to take the positivesign before i.e., to consider
if rather than or more general, if is decreasing
as increases
This simple example demonstrates the basic features of the algorithm(1.1.5) or (1.1.7): 1) The algorithm may converge to a root of 2) Thelimit of the algorithm, if exists, should not depend on the initial value; 3)The convergence rate is defined by that how fast the observation errorsare averaged out
From (1.1.6) it is seen that the convergence rate is defined by
for linear functions In the case where is a sequence of dent and identically distributed random variables with zero mean andbounded variance, then
indepen-by the iterated logarithm law
This means that convergence rate for algorithms (1.1.5) or (1.1.7) witherror-corrupted observations should not be faster than
We have just shown how to find the root of an unknown linear functionbased on noisy observations We now formulate the general problem
Trang 19Let be an unknown function with unknown root
Assume can be observed at each point with noise
and is the estimate for at time
Stochastic approximation algorithms recursively generate to proximate based on the past observations In the pioneer work of thisarea Robbins and Monro proposed the following algorithm
ap-to estimate where step size is decreasing and satisfies the
We explain the meaning of conditions required for step size
Condition aims at reducing the effect of observation noises
To see this, consider the case where is close to and is close
Throughout the book, always means the Euclidean norm of avector and denotes the square root of the maximum eigenvalue
of the matrix where means the transpose of the matrix A.
Even in the Gaussian noise case, may be large ifhas a positive lower bound Therefore, in order to have the desiredconsistency, i.e., it is necessary to use decreasing gainssuch that On the other hand, consistency can neither beachieved, if decreases too fast as To see this, let
Then even for the noise-free case, i.e., from (1.2.2) we have
if is a bounded function.Therefore, in this case
if the initial value is far from the true root and hence will neverconverge to
The algorithm (1.2.2) is now called Robbins-Monro (RM) algorithm.where is the observation at time is the observation noise,
Trang 20The classical approach to convergence analysis of SA algorithms isbased on the probabilistic analysis for trajectories We now present atypical convergence theorem by this approach Related concept andresults from probability theory are given in Appendices A and B.
In fact, we will use the martingale convergence theorem to prove thepath-wise convergence of i.e., to show For this, thefollowing set of conditions will be used
A 1.2.1 The step size is such that
A1.2.2 There exists a continuously twice differentiable Lyapunov tion satisfying the following conditions.
func-i) Its second derivative is bounded;
ii) and as
iii) For any there is a such that
where denotes the gradient of
A1.2.3 The observation noise is a martingale difference quence with
se-where is a family of nondecreasing
A1.2.4 The function and the conditional second moment of the observation noise have the following upper bound
where is a positive constant.
Prior to formulating the theorem we need some auxiliary results.Let be an adapted sequence, i.e., is
Define the first exist time of from a Borel set
It is clear that i.e., is a Markov time
Lemma 1.2.1 Assume and is a nonnegative gale, i.e.,
Trang 21supermartin-Then is also a nonnegative supermartingale, where
The proof is given in Appendix B, Lemma B-2-1
The following lemma concerning convergence of an adapted sequencewill be used in the proof for convergence of the RM algorithm, but thelemma is of interest by itself
Noticing that both and converge a.s
as we conclude that is also convergent a.s as
Consequently, from (1.2.5) it follows that converges a.s as
For proving ii) set
measurable and is nondecreasing, we have
Trang 22Taking conditional expectation leads to
Again, by the convergence theorem for nonnegative supermartingales,converges a.s as Since by the same theorem alsoconverges a.s as it directly follows that a.s
Theorem1.2.1 Assume Conditions A1.2.1–A1.2.4 hold Then for any initial value, given by the RM algorithm (1.2.2) converges to the root
of a.s as
Proof Let be the Lyapunov function given in A1.2.2 Expanding
to the Taylor series, we obtain
where and denote the gradient and Hessian of tively, is a vector with components located in-between the corre-sponding components of and and denotes the constant such
con-ditional expectation for (1.2.6), by (1.2.4) we derive
Since by (A1.2.1), we have
Denoting
Trang 23and noticing by A1.2.2, iii) from (1.2.7) and (1.2.8) it
follows that
Therefore, and converges a.s by the convergence
theorem for nonnegative supermartingales
For any denote
Let be the first exit time of from and let
where denotes the complement to This means that is the first
exit time from after
Since is nonpositive, from (1.2.9) it follows that
for any
Then by (1.2.2), this implies that
By Lemma 1.2.2, ii), the above inequality implies
which means that must be finite a.s Otherwise, we would have
a contradiction to A1.2.1 Therefore, after with
Trang 24possible exception of a set with probability zero the trajectory of
However, we have shown that converges a.s Therefore,
a.s By A1.2.2, ii) we then conclude that a.s
Remark 1.2.1 If Condition A1.2.2 iii) changes to
then the algorithm (1.2.2) should accordingly change to
We now explain conditions required in Theorem 1.2.1 As noted inSection 1.1, the step size should satisfy but the condition
As in many cases, one can take to serve as Then from(1.2.4) it follows that the growth rate of as should not befaster than linear This is a major restriction to apply Theorem 1.2.1
However, if we a priori assume that generated by the algorithm
(1.2.2) is bounded, then is bounded provided is locallybounded, and then the linear growth is not a restriction for
1,2, }
As mentioned in Section 1.2, the classical probabilistic approach toanalyzing SA algorithms requires rather restrictive conditions on theobservation noise In nineteen seventies a so-called ordinary differentialequation (ODE) method was proposed for analyzing convergence of SA
Trang 25algorithms We explain the idea of the method The estimate
generated by the RM algorithm is interpolated to a continuous function
with interpolating length equal to the step size used in the
algo-rithm The tail part of the interpolating function is shown to satisfy
an ordinary differential equation The sought-for root is the
equilibrium of the ODE By stability of this equation, or by assuming
existence of a Lyapunov function, it is proved that From
this, it can be deduced that
For demonstrating the ODE method we need two facts from analysis,
which are formulated below as propositions
equi-continuous and uniformly bounded functions, where by
equi-continuity we mean that for any and any there exists
such that
Then there are a continuous function and a subsequence of
functions which converge to uniformly in any finite interval of
i.e.,
uniformly with respect to belonging to any finite interval.
Proposition 1.3.2 For the following ODE
with
if there exists a continuously differentiable function such that
then the solution to ( 1.3.1), starting from any initial value, tends to
as i.e., is the global asymptotically stable solution to
( 1.3.1).
Let us introduce the following conditions
A1.3.2 There exists a twice continuously differentiable Lyapunov
and
A1.3.1
whenever
Trang 26In order to describe conditions on noise, we introduce an valued function for any and any integer
integer-For define
Noticing that tends to zero, for any fixed diverges to
infinity as In fact, counts the number of iterationsstarting from time as long as the sum of step sizes does not exceedThe integer-valued function will be used throughout the book.The following conditions will be used:
A1.3.3 satisfies the following conditions
A1.3.4 is continuous.
Theorem 1.3.1 Assume that A1.3.1, A1.3.2, and A1.3.4 hold If for a
fixed sample A1.3.3 holds and generated by the RM algorithm (1.2.2) is bounded, then for this tends to as
Proof Set
Define the linear interpolating function
It is clear that is continuous and
Further, define and the corresponding linear lating function which is defined by (1.3.4) with replaced bySince we will deal with the tail part of we define by shiftingtime in
interpo-Thus, we derive a family of continuous functions
Trang 27Let us define the constant interpolating function
Then summing up both sides of (1.2.2) yields
From this it follows that
which tends to zero as and then by A1.3.3
For any we have
By boundedness of and (1.3.11) we see that is continuous
Trang 28equi-By Proposition 1.3.1, we can select from a convergent quence which tends to a continuous function
subse-Consider the following difference with
which is derived by using (1.3.11)
By (1.3.9) it is clear that for
Then from (1.3.12) we obtain
Tending to zero in (1.3.13), by continuity of and uniform vergence of to we conclude that the last term in (1.3.13)converges to zero, and
Trang 29con-By A1.3.2 and Proposition 1.3.2 we see as
We now prove that Assume the converse: there is asubsequence
By (1.3.4) we have
It is clear that the family of functions indexed by
is uniformly bounded and equi-continuous Hence, we can select aconvergent subsequence, denoted still by The limit satisfiesthe ODE (1.3.14) and coincides with being the limit of bythe uniqueness of the solution to (1.3.14)
By the uniform convergence we have
which implies that
From here by (1.3.15) it follows that
Then we obtain a contradictory inequality:
for large enough such that and This completesthe proof of
We now compare conditions used in Theorem 1.3.1 with those in orem 1.2.1
The-Conditions A1.3.1 and A1.3.2 are slightly weaker than A1.2.1 and
A1.2.2, but they are almost the same The noise condition A1.3.3 issignificantly weaker than those used in Theorem 1.2.1, because underthe conditions of Theorem 1.2.1 we have
which certainly implies A1.3.3
Trang 30As a matter of fact, Condition A1.3.3 may be satisfied by sequencesmuch more general than martingale difference sequences.
deter-ministic sequence Then satisfies A1.3.3
This is because
Example1.3.2 Let be an MA process, i.e.,
where is a martingale difference sequence with
Then under condition A1.2.1, a.s., and hence
a.s Consequently, A1.3.3 is satisfied for almost all sample pathsCondition A1.3.4 requires continuity of which is not required inA1.2.4 At first glance, unlike A1.2.4, Condition A1.3.4 does not imposeany growth rate condition on but Theorem 1.3.1 a priori requires
the boundedness of which is an implicit requirement for the growthrate of
The ODE method is widely used in convergence analysis for rithms arising from various application areas, because from the noise
algo-it requires no probabilistic property which would be difficult to verify.Concerning the weakness of the ODE method, we have mentioned that
it a priori assumes that is bounded This condition is difficult to
be verified in general case The other point should be mentioned thatCondition A1.3.3 is also difficult to be verified in the case wheredepends on the past which often occurs when containsstructural errors of This is because A1.3.3 may be verifiable if isconvergent, but may badly behave depending upon the behavior of
So we are somehow in a cyclic situation: with A1.3.3 we canprove convergence of on the other hand, with convergent wecan verify A1.3.3 This difficulty will be overcome by using Trajectory-Subsequence (TS) method to be introduced in the next section and used
in subsequent chapters
In Section 1.2 we considered the root-seeking problem where thesought-for root may be any point in If the region belongs
as
Trang 31to is known, then we may use the truncated algorithm and the growthrate restriction on can be removed.
Let us assume that and is known In lieu of (1.2.2) wenow consider the following truncated RM algorithm:
where the observation is given by (1.2.1), is a given point,and
The constant used in (1.4.1) will be specified later on
The algorithm (1.4.1) means that it coincides with the RM algorithmwhen it evolves in the sphere but if exits thesphere then the algorithm is pulled back to the fixed point
We will use the following set of conditions:
A1.4.1 The step size satisfies the following conditions
A1.4.2 There exists a continuously differentiable Lyapunov function
(not necessarily being nonnegative) such that
(1.4.1)) there is such that
A1.4.3 For any convergent subsequence of
where is given by (1.3.2);
A1.4.4 is measurable and locally bounded.
We first compare these conditions with A1.3.1–A1.3.4 We note that
A1.4.1 is the same as A1.3.1, while A1.4.2 is weaker than A1.2.2
The difference between A1.3.3 and A1.4.3 consists in that Condition
(1.4.2) is required to be verified only along convergent subsequences,while (1.3.3) in A1.3.3 has to be verified along the whole sequence
Trang 32It will be seen that A1.4.3 in many problems can be verified while A1.3.3
is difficult to verify
Comparing A1.4.4 with A1.3.4 we find that the conditions on havenow been weakened The growth rate restriction used in Theorem 1.2.1and the boundedness assumption on imposed in Theorem 1.3.1have been removed in the following theorem
Theorem 1.4.1 Assume Conditions A1.4.1, A1.4.2, and A1.4.4 hold and the constant in A1.4.2 is available Set for (1.4.1) If for some sample path A1.4.3 holds, then given by (1.4.1) converges
to for this
if and
We first prove that the number of truncations in (1.4.1) may happen
at most for a finite number of steps Assume the converse: there areinfinitely many truncations occurring in (1.4.1) Since
by A1.4.2, there is an interval such that
cross
Since is bounded, we may extract a convergent subsequencefrom Let us denote the extracted convergent subsequence still
Since the limit of is located in the open sphere
there is an such that
for all sufficiently large
Since is bounded by Al.4.4 and the boundedness ofusing (1.4.2) we have
Trang 33if is small enough and is large enough.
This incorporating with (1.4.5) implies that
Therefore, the norm of
cannot reach the truncation bound In other words, the algorithm(1.4.1) turns to be an untruncated RM algorithm (1.4.7) for
for small and large
By the mean theorem there exists a vector with components locatedin-between the corresponding components of and suchthat
Notice that by (1.4.2) the left-hand side of (1.4.6) is of for allsufficiently large since is bounded From this it follows that i)
for small enough and large enough
and hence and ii) the last term in (1.4.8) is ofsince as From (1.4.7) and (1.4.8) it thenfollows that
Since the interval does not contain the origin Noticing
and that there is such that
Trang 34for sufficientlysmall and all large enough Then by A1.4.2 there
mag-Taking (1.4.4) into account, from (1.4.10) we find that
for large However, we have shown that
The obtained contradiction shows that the number of truncations in(1.4.1) can only be finite
We have proved that starting from some large the algorithm (1.4.1)develops as an RM algorithm
and is bounded
We are now in a position to show that converges
Assume it were not true Then we would have
Then there would exist an interval not containing the originand would cross for infinitely many
Again, without loss of generality, assuming by the sameargument as that used above, we will arrive at (1.4.9) and (1.4.10) forlarge and obtain a contradiction Thus, tends to a finite limitas
It remains to show that
Assume the converse that there is a subsequence
Then there is a such that for all sufficiently large
We still have (1.4.8), (1.4.9), and (1.4.10) for some
Trang 35Tending in (1.4.10), by convergence of we arrive at acontradictory inequality:
The obvious weakness of Theorem 1.4.1 is the assumption on the ability of the upper bound for This limitation will be removedlater on
Up-to now we have worked with decreasing gains which are necessaryfor path-wise convergence when observations are corrupted by noise.However, in some applications people prefer to using constant gain:
where in contrast to (1.2.2) a constant stands for which tends
to zero as
Define the piece-wise constant interpolating function as
Then which is the space of real functions on thatare right continuous and have left-hand limits, endowed with the Skoro-hod topology Convergence of to a continuous function
in the Skorohod topology is equivalent to the uniform convergence
on any bounded interval
Let and be probability measures determined by stochastic
by the Skorohod topology
Trang 36If for any bounded continuous function defined on
then we say that weakly converges to
If for any there is a compact measurable set in
such that
then is called tight
Further, is called relatively compact if each subsequence ofcontains a weakly convergent subsequence
In the weak convergence analysis an important role is played by theProhorov’s Theorem, which says that on a complete and separable met-ric space, tightness is equivalent to relative compactness The weakconvergence method establishes the weak limit of as andconvergence of to in probability as where
as
Theorem 1.5.1 Assume the following conditions:
A1.5.1 is a.s bounded;
Further, if is asymptotically stable for (1.5.3), then for any
as the distance between and
converges to zero in probability as
In stead of proof, we only outline its basic idea First, it is shownthat we can extract a subsequence of weakly converging to
Trang 37For notational simplicity, denote the subsequence still by By
the Skorohod representation, we may assume For
this we need only, if necessary, to change the probabilistic space and take
and on this new space such that and
have the same distributions as those of and respectively
Then, it is proved that
is a martingale Since and as can be shown, is Lipschitz
continuous, it follows that
Since is relatively compact and the limit does not depend on
the extracted subsequence, the whole family weakly converges
to as and satisfies (1.5.3) By asymptotic stability of
Remark 1.5.1 The boundedness assumption on may be removed
For this a smooth function is introduced such that
and the following truncated algorithm
is considered in lieu of (1.5.1) Then is interpolated to a piece-wise
that is tight, and weakly convergent as The limit
satisfies
Finally, by showing lim sup lim sup for some
for each it is proved that itself is tight and weakly
converges to satisfying (1.5.3)
The stochastic approximation algorithm was first proposed by
Rob-bins and Monro in [82], where the mean square convergence of the
algo-rithm was established under the independence assumption on the
obser-vation noise Later, the noise was extended from independent sequence
to martingale difference sequences (e.g [7, 40, 53])
Trang 38The probabilistic approach to convergence analysis is well summarized
in [78]
The ODE approach was proposed in [65, 72], and then it was widelyused [4, 85] For detailed presentation of the ODE method we refer to[65, 68]
The proof of Arzelá-Ascoli Theorem can be found in ([37], p.266).Section 1.4 is an introduction to the method described in detail incoming chapters For stability and Lyapunov functions we refer to [69].The weak convergence method was developed by Kushner [64, 68].The Skorohod topology and Prohorov’s theorem can be found in [6, 41].For probability concepts briefly presented in Appendix A, we refer
to [30, 32, 70, 76, 84] But the proof of the convergence theorem formartingale difference sequences, which are frequently used throughoutthe book, is given in Appendix B
Trang 39STOCHASTIC APPROXIMATION THMS WITH EXPANDING TRUNCATIONS
ALGORI-In Chapter 1 the RM algorithm, the basic algorithm used in tic approximation(SA), was introduced, and four different methods foranalyzing its convergence were presented However, conditions imposedfor convergence are rather strong
stochas-Comparing theorems derived by various methods in Chapter 1, wefind that the TS method introduced in Section 1.4 requires the weakestcondition on noise The trouble is that the sought-for root has to be in-side the truncation region This motivates us to consider SA algorithmswith expanding truncations with the purpose that the truncation regionwill finally cover the sought-for root whose location is unknown This isdescribed in Section 2.1
General convergence theorems of the SA algorithm with expandingtruncations are given in Section 2.2 The key point of the proof is toshow that the number of truncations is finite If this is done, then theestimate sequence is bounded and the algorithm turns to be the conven-tional RM algorithm in a finite number of steps This is realized by usingthe TS method It is worth noting that the fundamental convergencetheorems given in this section are analyzed by a completely elementarymethod, which is deterministic and is limited to the knowledge of calcu-lus In Section 2.3 the state-independent conditions on noise are given
to guarantee convergence of the algorithm when the noise itself is dependent In Section 2.4 conditions on noise are discussed It appearsthat the noise condition in the general convergence theorems in a certainsense is necessary In Section 2.5 the convergence theorem is given forthe case where the observation noise is non-additive
state-In the multi-root (of case, up-to Section 2.6 we have only lished that the distance between the estimate and the root set tends to
estab-25
Trang 40In Chapter 1 we have presented four types of convergence theorems
using different analysis methods for SA algorithms However, none of
these theorems is completely satisfactory in applications Theorem 1.2.1
is proved by using the classical probabilistic method, which requires
restrictive conditions on the noise and As mentioned before, the
noise may contain component caused by the structural inaccuracy of
the function, and it is hard to assume this kind of noise to be mutually
independent or to be a martingale difference sequence etc The growth
rate restriction imposed on the function not only is sever, but also is
unavoidable in a certain sense To see this, let us consider the following
example:
It is clear that conditions A1.2.1, A1.2.2, and A1.2.3 are satisfied,
where for A1.2.2 one may take The only condition
that is not satisfied is (1.2.4), since while the hand side of (1.2.4) is a second order polynomial Simple calculation
right-shows that given by RM algorithm rapidly diverges:
From this one might conclude that the growth rate restriction would
be necessary
However, if we take the initial value with then
given by the RM algorithm converges to To reduce initial value
in a certain sense, it is equivalent to use step size not from but from
for some The difficulty consists in that from which we should
zero But, by no means this implies convergence of the estimate itself
This is briefly discussed in Section 2.4, and is considered in Section 2.6
in connection with properties of the equilibrium of Conditions
are given to guarantee the trajectory convergence It is also considered
whether the limit of the estimate is a stable or unstable equilibrium of
In Section 2.7 it is shown that a small distortion of conditions
may cause only a small estimation error in limit, while Section 2.8 of
this chapter considers the case where the sought-for root is moving
dur-ing the estimation process Convergence theorems are derived with the
help of the general convergence theorem given in Section 2.2 Notes and
references are given in the last section
2.1 Motivation