CHAPTER 9 Limit theorems 9.1 The early limit theorems The term ‘limit theorems’ refers to several theorems in probability theory under the generic names, ‘law of large numbers’ LLN
Trang 1CHAPTER 9
Limit theorems
9.1 The early limit theorems
The term ‘limit theorems’ refers to several theorems in probability theory under the generic names, ‘law of large numbers’ (LLN) and ‘central limit theorem’ (CLT) These limit theorems constitute one of the most important and elegant chapters of probability theory and play a crucial role in statistical inference The origins of these theorems go back to the seventeenth-century result proved by James Bernoulli
Bernoulli's theorem
Let S,, be the number of occurrences of an event A in n independent trials of a random experiment & and p= P(A) is the probability of occurrence of A in each of the trials Then for any e>0
S
HH œ n
i.e the limit of the probability of the event |[(S,/n)—p]|<e
approaches one as the number of trials goes to infinity
Shortly after the publication of Bernoulli’s result De Moivre and Laplace
in their attempt to provide an easier way to calculate binomial probabilities proved that when [(S,,/n) — p] is multiplied by a factor equal to the inverse of its standard error the resulting quantity has a distribution which approaches the normal as n > x, i.e
K2p= 7 + exp{ —4u?! du I (9.2)
Trang 2166 Limit theorems
These two results gave rise to a voluminous literature related to the various ramifications and extensions of the Bernoulli and De Moivre— Laplace theorems known today as ‘the’ LLN and ‘the’ CLT respectively The purpose of this chapter is to consider some of the extensions of the
Bernoulli and De Moivre—Laplace results In the discussion which follows
emphasis is placed on the intuitive understanding of the conclusions as well
as the crucial assumptions underlying the various limit theorems The discussion is semi-historical in a conscious attempt to motivate the various extensions and the weakening of the underlying assumptions giving rise to the results
The main conditions underlying the Bernoulli and De Moivre—Laplace results are the following:
(LT1) S,=>?., X;, that is, S, defined as the sum of n random variables
(r.v.’s)
(LT2) X;,;=1,if A occurs, and X,;=0, otherwise, i=1,2, ,n, i.e the X;s
are Bernoulli r.v.’s and hence S,, is a binomially distributed r.v (LT3) X,,X>, , X, are independent r.v.’s
(LT4) f(x,)=f(x.)=++-=f(x,), ie X,, X, ., X, are identically
distributed with Pr(X, = 1)=p, Pr(X,;=0)=1—p fori=1,2, ,n (LT5) E(S,/n)=p, i.e we consider the event of the difference between arw
and its expected value
The main difference between the Bernoulli and De Moivre—Laplace
theorems lies in their notion of convergence, the former referring to the convergence of the probability associated with the sequence of events
|[(S,/n)—p]|<e and the latter to the convergence of the probability
associated with a very specific sequence of events, that is, events of the form (Z <z) which define the distribution function F(z) In order to discriminate
between them we call the former ‘convergence in probability’ and the latter
‘convergence in distribution’
Definition 1
A sequence of r.v.’s {Y,,n> 1} is said to converge in probability to a
rv (or constant) Y if for every e>0
ñ>œ
P
We denote this with Y, — Y
Definition 2
A sequence of rv.’s {Y,,n>1} with distribution functions {F,(y), n> 1} is said to converge in distribution to a r.v Y with distribution
Trang 39.1 The early limit theorems 167 function F(y) if
D
at all points of continuity of F(y); denoted by Y, — Y
It should be emphasised that neither of the above types of convergence tells us anything about any convergence of the sequence {Y,} to Y in the sense used in mathematical analysis, such as for each ¢>0 and séS, there exists an N= N(e,s) such that
Both convergence types refer only to convergence of probabilities or
functions associated with probabilities On the other hand, the definition of
a r.v has nothing to do with probabilities and the above convergence of Y,
to Y on S is convergence of real valued functions defined on S The type of stochastic convergence which comes closer to the above mathematical convergence is known as ‘almost sure’ convergence
Definition 3
A sequence of r.v.’s {Y,, n> 1} converges to Y (ar.v ora constant) almost surely (or with probability one) if
Pr tim Y,= v= 1; denoted by Y, ~ Y, (9.6)
or, equivalently, if for any e>0
This is a much stronger mode of convergence than either convergence in probability or convergence in distribution For a more extensive discussion
of these modes of convergence and their interrelationships see Chapter 10
The limit theorems associated with convergence almost surely are
appropriately called ‘strong law of large numbers’ (SLLN) The term is used
to emphasise the distinction with the ‘weak law of large numbers’ (WLLN)
associated with convergence in probability
In the next section the law of large numbers is used as an example of the
developments the various limit theorems have undergone since Bernoulli For this reason the discussion is intentionally rather long in an attempt to
motivate a deeper understanding of the crucial assumptions giving rise to all the limit theorems considered in the sequel.
Trang 4168 Limit theorems
9.2 The law of large numbers
qd) The weak law of large numbers (WLLN)
Early in the nineteenth century Poisson realised that the condition LT4 asserting identical distributions for X,, , X,, was not necessary for the result to go through
Poisson's theorem
Let {X,, n= 1} be a sequence of independent Bernoulli r.v.s with Pn(X,= 1)=p; and Pr(X,=0) = I—p,,¡= l,2, , n, then, for any c>0,
H 1
| Si, lim PT n 2 D;
The important breakthrough in relation to the WLLN was made by Chebyshev who realised that not only LT4 but LT2 was unnecessary for the result to follow That is, the fact that X,, , X,, were Bernoulli r.v.’s was not contributing to the result in any essential way What was crucially important was the fact that we considered the summation of n r.v.’s to form
S,= 7 ,X; and comparing it with its mean
Chebyshev’s theorem
Let {X,,,n 2 1} be a sequence of independent r.v.’s such that E(X;)= u;, Var(X;)=0?<c<a,i=1,2 ,n, then for any e>0,
1
H— % h
In order to see how these conditions ensure the result let us prove Chebyshev’s theorem
n
» X;- ¬x
Proof: Since the X,s are independent
1 H ] H
va YX } nã » of < n
Hút
Using Chebyshev’s mae for (1/n) ¥, X; we get
H ¿=1
since lim La lim dị
nớ VỆ n> 06
oF
=
2 Sn? ne
Š Xing >>
:)=0
Trang 59.2 The law of large numbers 169
Markov, a student of Chebyshev’s, noticed in the proof of Chebyshev’s theorem the fact that the X,, X,, , X, are ae etn played only a
minor role in enabling us to deduce that Var(S,)=(1/n7) )?_, 07 The
above proof goes through provided that (1/n7) vang n) 7 Oasn > x Since
nox
Var(S => Var(X,)+)_5, Cov(X,X,)† (9.10)
iAj
we need to assume that Var(); X;) is of smaller order of magnitude (see Chapter 10) than n? for the result to follow Hence LT3 is not a crucial condition
Markov’s theorem
Let {X,,, ne 1} be a sequence of r.v.s such that
|
1
lim "|
now H
Khinchin, a student of Markov's, realised that, in the case ofindependent
and identically distributed (IID) 1.v.’s, Markov’s condition was not a necessary condition In fact in the ITD case no restriction on the nature of the variances is needed
then
H I H
Khinchin’s theorem
Let {X,, n> 1} be a sequence of UD rwv.s, then the existence of E(X,)= for alli is sufficient to imply that for any e>0
1
lim Pr( | 3 X,=H
¿=1
Kolmogorov (1926) settled the issue by providing both necessary as well
as sufficient conditions for the WLLN
Kolmogorov’s theorem |
The sequence of r.vis {X,, 121} obev the WLLN if and only if
s-pa]
nh
+ There are n? terms and thus if all of them are bounded the Var(}""_, x;) is at least of the same order as n*
Trang 6170 Limit theorems
(2) The strong law of large numbers (SLNN)
The first result relating to the almost sure convergence of S, for the Bernoulli distributed r.v.’s case was proved by Borel in 1909
Borel’s theorem
Let {X,,} be a sequence of 11D Bernoulli r.v.’s with Pr(X ;= 1) = p and
Pr(X;=0)=1—-—p for all i, then
S
nO n
In other words, the event defined by {s:lim, , [S,(s)]/n=p, s¢S}, has probability one; S being the sample space An equivalent way to express this 1S
m
lim Pr( ma
This brings out the relationship between the SLLN and the WLLN since the former refers to the simultaneous realisation of the inequalities and
<max
màn
This implies that '—>”='—'
Kolmogorov, by replacing the Markov condition
for the WLLN in the case of independent r.v.’s, with the stronger condition
|
k=1
proved the first SLLN for a general sequence of independent r.v.’s
Kolmogorov’s theorem 2
Let {X„, n> 1} be a sequence of independent r.v.’s such that E(X ;) and Var(X;) exist for all i=1, 2, ., then if they satisfy the condition (19) we can deduce that
Prim A ; [X)- BUX) }=0 =
n> a i=1
Trang 79.2 The law of large numbers 171
This SLLN is analogous to Chebyshev’s WLLN and in the same way we can prove it using an inequality The inequality used in this context is
Kolmogorov’s inequality: If X,, X2, , X,, are independent r.v.’s, such that Var(X,)=07<x,i=1,2, ,n, then for any e>0
1 n
Kolmogorov went on to prove that in the case where {X,, n21} is a
sequence of IID r.v.’s such that E(X;)< o then
“ Var(X,) 2 1 (*
k=1 k=1 —k
which implies that for such a sequence the existence of expectation is a
necessary as well as sufficient condition for the SLLN
Having argued that some of the conditions of the Bernoulli theorem did
not contribute (in any essential way) to the result, the question that arises naturally is, ‘what are the important elements giving rise to the “law of large numbers” (SLLN, WLLN)?” The Markov condition (18) for the
WLLN, and Kolmogorov’s condition (19) for the SLLN, hold the key to the
answer of this question It is clear from these two conditions that the most important ingredient is the restriction on the variance of the partial sums S,,, that is, we need the Var(S,) to increase at most as quickly as n More formally we need Var(S,) to be at most of order n and we write Var(S,)= O(n) In order to see this let us consider some of the cases discussed above
In the IID case if Var(X,)=o? for all i, then Var(S,,)=no? = O(n)
In the case of independent r.v.’s with
Var(X,)=ø°<øœ., i=1,2, , then Var(S,)= ) o7= O(n)
¡=1
(9.22)
Moreover, the Markov condition can be written as Var(S„)= o(n?) where small ‘o’ reads ‘of smaller order than’ achieves the same effect since
Var(S,,) = O(n) = Var(S,)=o(n?) (see Chapter 10) The Kolmogorov
condition is a more restrictive form of the Markov condition, requiring the variance of the partial sums to be uniformly of at most of order n This being the case, it becomes obvious that the conditions LT3 and LT4, assuming independence and identically distributed r.v.’s, are not fundamental ingredients Indeed, if we drop the identically distributed condition altogether and weaken independence to martingale orthogonality the above
limit theorems go through with minor modifications We say that a sequence of r.v.’s {X,, n> 1} is martingale orthogonal if E(X,,/o(X,-1, ,
Trang 8172 Limit theorems
X,))=0,n> † It should come as no surprise to learn that both important tools in proving the WLLN and SLLN, the Chebyshev and Kolmogorov inequalities hold true for orthogonal r.v.’s This enables us to prove the WLLN and SLLN under much weaker conditions than the ones discussed above The most useful of these results are the ones related to martingales because they can be seen as direct extensions of the ‘independent’ case and the results are general enough to cover most types of dependencies we are interested in
(3) The law of large numbers for martingales
Let {S,, Zn EN} be a martingale such that £(S,)=9, for all n, and define
Y, =S, —S,—1, n2=1 (Sp=0) As discussed in Section 8.4, if S, defines a martingale with respect to &%, then by construction Y, defines an orthogonal process and thus, assuming a bounded variance for Y,, the above limit theorems can go through with minor modifications
WLLN for martingales
Let {X,,, n> 1} be a sequence of r.v.s with respect to the increasing
sequence of o-fields {%,,n>1} such that E(|X\)<o and
P(X, >x) <cP(|X|>x) for x>0 and n> 1, c-constant (i.e all Xjs are bounded by some rv X) Then
lim Pr
n
where S,= )'., Y; is a martingale with respect to Y,, n>1, and
An equivalent way to state the WLLN is
li
SLLN for martingales
For the martingale (X,,, G,,n = 1} satisfying the assumptions of the
WLLN if the sequences {X„„n> L} and {E(X,/2,-¡), n>1} are
stationary, then
nh LX, - EX/G-1)] > 0 (9.26)
i=l
This result shows clearly how the assumption of stationarity of {X,,n> 1} H
Trang 99.3 The central limit theorem 173
and {E(X„/2„_¡),n> 1} (see Chapter 8) can strengthen the WLLN result to
that of the SLLN
The above discussion suggests that the most important ingredients of the Bernoulli theorem are that:
(i) we consider the probabilistic behaviour of centred r.v.’s of the form
Z,=S,—np= 7- 1 (X;— E(X)));
(H) Var(S„)= O(m): and
(iii) for Y,=X,—E(X,), the sequence {Y,, n21} is a martingale
difference, i.e E(Y,/o(¥,-,, - Y,)=0,n2 1
This suggests that martingales provide a very convenient framework for these limit theorems because by definition they are r.v.’s with respect to an increasing sequence of o-fields and under some general conditions they converge to some r.v asm > x The latter being of great importance when convergence to a non-degenerate r.v is needed Moreover, for any
martingale sequence (X,, Y,, n21} the martingale differences sequence
LY, = 1} defines a martingale orthogonal sequence of r.v.’s which can help
us ensure (ii) above
Remark: The SLLN is sometimes credited as providing a mathematical foundation for the frequency approach to probability This is, however, erroneous because the definition is rendered circular given that we need a notion of probability to define the SLLN in the first place
As with the WLLN and SLLN, it was realised that LT2 was not contributing in any essential way to the De Moivre—Laplace theorem and the literature considered sequences of r.v.’s with restrictions on the first few
moments Let {X,,,n > 1} be a sequence of r.v.’s and $, = )"_, X;, the CLT
considers the limiting behaviour of
Si ~ E(S,)
which is a normalised version of S, — E(S,,), the subject matter of the WLLN and SLLN
Lindeberg—Levy theorem
Let (X,, n> 1} be a sequence of IID res such that E(X;)=n,
Var(X,=ø?< x for all i Then for F,(y) the DF of Y,
lim F„(y)= lim P(Y„<y)= —>>~— €Xp{ —}w?} du
(9.28)
Trang 10174 Limit theorems
Liapunov’s theorem
Let {X,,n2 1! be a sequence of independent r.v.’s with
E(X))=u;, Var(X,)=a7 <x, E(X,j?°)<œ, 5>0
Define
(Ee)
lim (ars ` EWN, xn (9.29)
lim F„(y)= Ỉ em exp! —4u?! du (9.30)
Liapunov’s theorem is rather restrictive because it requires the existence of
moments higher than the second A more satisfactory result providing both
necessary and sufficient conditions is the next theorem; Lindeberg in 1923 established the ‘if part and Feller in 1935 the ‘only if? part
Lindeberg—Feller theorem
Let {X,,,n2 1} be a sequence of independent r.v.’s with distribution
functions {F,(x),n2 1} such that
i) E(X,)=m;
i=1,2, {°
(ii) Var(X, =o? <x, (9.31)
Then the relations
(a) lim max “=0, where „=(Š ø) ; (9.32)
n>œ I<iSn Cn 1
¥ 1
hold true, if and only if,
lim Ệ S | (x —p,)? AFI) }=0, (9.34)
[x Hi] > ec;
n
» \ : (x —y,)? dF {x)=Olc2) for all e>0 (9.35)
T-m|>e;
¡=1
The necessary and sufficient condition is known as the Lindeberg condition
and provides an intuitive insight into ‘what really gives rise to the result’.