Laplace proposed what is known today as the classical definition of probability: Definition | If a random experiment can result in N mutually exclusive and equally likely outcomes and if
Trang 1PARTH
Probability theory
Trang 2CHAPTER 3
Probability
‘Why do we need probability theory in analysing observed data” In the descriptive study of data considered in the previous chapter it was emphasised that the results cannot be generalised outside the observed data under consideration Any question relating to the population from which the observed data were drawn cannot be answered within the descriptive
statistics framework In order to be able to do that we need the theoretical
framework offered by probability theory In effect probability theory develops a mathematical model which provides the logical foundation of statistical inference procedures for analysing observed data
In developing a mathematical model we must first identify the important features, relations and entities in the real world phenomena and then devise the concepts and choose the assumptions with which to project a generalised description of these phenomena; an idealised picture of these phenomena The model as a consistent mathematical system has a ‘life of its own’ and can be analysed and studied without direct reference to real world phenomena Moreover, by definition a model should not be judged as ‘true’
or ‘false’, because we have no means of making such judgments (see Chapter 26) A model can only be judged as a ‘good’ or ‘better’ approximation to the
‘reality’ it purports to explain if it enables us to come to grips with the phenomena in question That is, whether in studying the model’s behaviour
the patterns and results revealed can help us identify and understand the
real phenomena within the theory’s intended scope
The main aim of the present chapter is to construct a theoretical model for probability theory In Section 3.1 we consider the notion of probability itself as a prelude to the axiomatisation of the concept in Section 3.2 The
probability model developed comes in the form of a probability space (S, ¥ P(-)) In Section 3.3 this is extended to a conditional probability space
33
Trang 334 Probability
3.1 The notion of probability
The theory of probability had its origins in gambling and games of chance
in the mid-seventeenth century and its early history is associated with the
names of Huygens, Pascal, Fermat and Bernoulli This early development
of probability was rather sporadic and without any rigorous mathematical foundations The first attempts at some mathematical rigour and a more sophisticated analytical apparatus than just combinatorial reasoning, are credited to Laplace, De Moivre, Gauss and Poisson (see Maistrov (1974)) Laplace proposed what is known today as the classical definition of probability:
Definition |
If a random experiment can result in N mutually exclusive and equally likely outcomes and if N, of these outcomes result in the occurrence of the event A, then the probability of A is defined by
Nà
To illustrate the definition let us consider the random experiment of tossing
a fair coin twice and observing the face which shows up The set of all
equally likely outcomes is
S={(HT),(TH),(HH),(TT)}, with N=4
Let the event 4 be ‘observing at least one head (H)’, then
A= {(HT),(TH),(HH)}
Since N,=3, P(A)=2 Applying the classical definition in the above example is rather straightforward but in general it can be a tedious exercise
in combinatorics (see Feller (1968)) Moreover, there are a number of serious shortcomings to this definition of probability, which render it totally inadequate as a foundation for probability theory The obvious limitations of the classical approach are:
(i) it is applicable to situations where there is only a finite number of
possible outcomes; and
(1) the ‘equally likely’ condition renders the definition circular Some important random experiments, even in gambling games (in response
to which the classical approach was developed) give rise to a set of infinite outcomes For example, the game played by tossing a coin until it turns up heads gives rise to the infinite set of possible outcomes S= {(H), (TH), (TTH), (TTTR), .}; it is conceivable that somebody could flip a coin
indefinitely without ever turning up heads! The idea of ‘equally likely” is
Trang 43.1 The notion of probability 35
synonymous with ‘equally probable’, thus probability is defined using the idea of probability! Moreover, the definition is applicable to situations where an apparent ‘objective’ symmetry exists, which raises not only the question of circularity but also how this definition can be applied to the case
of a biased coin or to consider the probability that next year’s rate of inflation in the UK will be 10°%4? Where are the ‘equally likely’ outcomes and which ones result in the occurrence of the event? These objections were well known even by the founders of this approach and since the 1850s several attempts have been made to resolve the problems related to the
‘equally likely’ presupposition and extend the area of applicability of probability theory
The most influential of the approaches suggested in an attempt to tackle
the problems posed by the classical approach are the so-called frequency
and subjective approaches to probability The frequency approach had its origins in the writings of Poisson but it was not until the late 1920s that Von Mises put forward a systematic account of the approach The basic argument of the frequency approach is that probability does not have to be restricted to situations of apparent symmetry (equally likely) since the notion of probability should be interpreted as stemming from the observable stability of empirical frequencies For example, in the case ofa fair coin we say that the probability of A= {H} is 4, not because there are two equally likely outcomes but because repeated series of large numbers of trials demonstrate that the empirical frequency of occurrence of A
‘converges’ to the limit 4 as the number of trials goes to infinity If we denote
by n, the number of occurrences of an event A in n trials, then if
noo n
we say that P(A)= P,, Fig 3.1 illustrates this notion for the case of A= {H}
in a typical example of 100 trials As can be seen, although there are some
‘wild fluctuations’ of the relative frequency for a small number of trials, as these increase the relative frequency tends to ‘settle’ (converge around 4) Despite the fact that the frequency approach seems to be an improvement over the classical approach, giving objective status to the notion of probability by rendering it a property of real world phenomena, there are some obvious objections to it ‘What is meant by “limit as n goes to infinity”? ‘How can we generate infinite sequences of trials? ‘What happens
to phenomena where repeated trials are not possible?”
The subjective approach to probability renders the notion of probability
a subjective status by regarding it as ‘degrees of belief? on behalf of
individuals assessing the uncertainty of a particular situation The
Trang 536 Probability
1.0
0.9
0.8
n\ 0.7
(7) 061
0.5
0.4
0.3
0.2
0.1
L11111111111111117L111L Ly YO LA A LI
n
Fig 3.1 Observed relative frequency of an experiment with 100 coin tossings
protagonists of this approach are inter alia Ramsey (1926), de Finetti (1937), Savage (1954), Keynes (1921) and Jeffreys (1961); see Barnett (1973) and Leamer (1978) on the differences between the frequency and subjective approaches as well as the differences among the subjectivists
Recent statistical controversies are mainly due to the attitudes adopted towards the frequency and subjective definitions of probability Although these controversies are well beyond the material covered in this book, it is advisable to remember that the two approaches lead to alternative methods
of statistical inference The frequentists will conduct the discussion around what happens ‘in the long-run’ or ‘on average’, and attempt to develop
‘objective’ procedures which perform well according to these criteria On the other hand, a subjectivist will be concerned with the question of revising prior beliefs in the light of the available information in the form of the
observed data, and thus devise methods and techniques to answer such
questions (see Barnett (1973}) Although the question of the meaning of probability was high on the agenda of probabilists from the mid-nineteenth century, this did not get in the way of impressive developments in the subject In particular the systematic development of mathematical techniques related to what we nowadays call limit theorems (see Chapter 9) These developments were mainly the work of the Russian School (Chebyshev, Markov, Liapounovy and Bernstein) By the 1920s there was a wealth of such results and probability began to grow into a systematic body
of knowledge Although various people attempted a systematisation of probability it was the work of the Russian mathematician Kolmogorov which proved to be the cornerstone for a systematic approach to
Trang 63.2 The axiomatic approach 37
probability theory Kolmogorov managed to relate the concept of probability to that of a measure in integration theory and exploited to the full the analogies between set theory and the theory of functions on the one hand and the concept of a random variable on the other In a monumental monograph in 1933 he proposed an axiomatisation of probability theory establishing it once and for all as part of mathematics proper There is no
doubt that this monograph proved to be the watershed for the later
development of probability theory growing enormously in importance and applicability Probability theory today plays a very important role in many disciplines including physics, chemistry, biology, sociology and economics
3.2 The axiomatic approach
The axiomatic approach to probability proceeds from a set of axioms (accepted without questioning as obvious), which are based on many centuries of human experience, and the subsequent development is built deductively using formal logical arguments, like any other part of mathematics such as geometry or linear algebra In mathematics an axiomatic system is required to be complete, non-redundant and consistent
By complete we mean that the set of axioms postulated should enable us to prove every other theorem in the theory in question using the axioms and mathematical logic The notion of non-redundancy refers to the impossibility of deriving any axiom of the system from the other axioms Consistency refers to the non-contradictory nature of the axioms
A probability model is by construction intended to be a description of a chance mechanism giving rise to observed data The starting point of such a model is provided by the concept of a random experiment describing
a simplistic and idealised process giving rise to observed data
Definition 2
Arandom experiment, denoted by &, is an experiment which satisfies
the following conditions:
(a) all possible distinct outcomes are known a priori;
(b) imany particular trial the outeome is not Khown a priori; and (c) it can be repeated under identical conditions
Although at first sight this might seem as very unrealistic, even as a model of
a chance mechanism, it will be shown in the following chapters that it can be
extended to provide the basis for much more realistic probability and
statistical models
The axiomatic approach to probability theory can be viewed as a formalisation of the concept of a random experiment In an attempt to
Trang 738 Probability
formalise condition (a) all possible distinct outcomes are known a priori,
Kolmogorov devised the set S which includes ‘all possible distinct
outcomes’ and has to be postulated before the experiment is performed
Definition 3
The sample space, denoted by S, is defined to be the set of all possible outcomes of the experiment 6 The elements of S are called elementary events
Example
Consider the random experiment & of tossing a fair coin twice and observing the faces turning up The sample space of & is
S={(HT),(TH),(HH),(TT)},
with (HT), (TH), (HH), (TT) being the elementary events belonging to S The second ingredient of & to be formulated relates to (b) and in particular
to the various forms events can take A moment’s reflection suggests that there is no particular reason why we should be interested in elementary outcomes only For example, in the coin experiment we might be interested
in such events as A, —‘at least one H’, A, —‘at most one H’ and these are not elementary events; in particular
and
A,=((AT),(TH),(TT)}
are combinations of elementary events All such outcomes are called events associated with the sample space S and they are defined by ‘combining’ elementary events Understanding the concept of an event is crucial for the discussion which follows Intuitively an event is any proposition associated
with & which may occur or not at each trial We say that event A, occurs
when any one of the elementary events it comprises occurs Thus, when a trial is made only one elementary event is observed but a large number of events may have occurred For example, if the elementary event (HT)
occurs in a particular trial, A, and A, have occurred as well
Given that S is a set with members the elementary events this takes us immediately into the realm of set theory and events can be formally defined
to be subsets of S formed by set theoretic operations (W’ - union, “VV — intersection, ‘”’— complementation) on the elementary events (see Binmore
Trang 83.2 The axiomatic approach 39
(1980)) For example,
i.e ‘two tails’ does not occur,
A, = (HT); 0 (TH) © ((TT)} = (HH) CS,
i.e ‘two heads’ does not occur
Two special events are S itself, called the sure event and the impossible event
@ defined to contain no elements of S, ic @={ }; the latter is defined for
completeness
A third ingredient of & associated with (b) which Kolmogorov had to formalise was the idea of uncertainty related to the outcome of any particular trial of 6 This he formalised in the notion of probabilities attributed to the various events associated with 6, such as P(A,), P(A,),
expressing the ‘likelihood’ of occurrence of these events Although
attributing probabilities to the elementary events presents no particular mathematical problems, doing the same for events in general is not as straightforward The difficulty arises because if A, and A, are events A, = S—A,,4,=S—A}, A, U Az, Ay Az, Ay — Ap, ete., are also events because the occurrence or non-occurrence of A, and A, implies the occurrence or
not of these events This implies that for the attribution of probabilities to
make sense we have to impose some mathematical structure on the set of all events, say -¥, which reflects the fact that whichever way we combine these events, the end result is always an event The temptation at this stage is to define ¥ to be the set of all subsets of S, called the power set; surely this covers all possibilities! In the above example the power set of S takes the form
Z= 8 @,\(AT)} (TH)} (HA)} (TT)}, (TH), (AT)}, {(TH), aH \(TH),(TT)}, ein, (HH)}, iT) (TT)}, (HH),(TT)}, (HT), (TH), (HH)} (AT), (TH), (TT)}, (HH), (TT), (TH)} (HA), (7 ny (HT)}},
It can be easily checked that whichever way we combine any events in ¥ we
end up with events in # For example,
{(HH),(TH)} vu {(TH),(HT)} = {(HH).(TH),(HT)}e Fete
It turns out that in most cases where the power set does not lead to any inconsistencies in attributing probabilities we define the set of events ¥ to
be the power set of § But when S is infinite or uncountable (it has as many
Trang 940 Probability
elements as there are real numbers) or we are interested in some but not all possible events, inconsistencies can arise For example, if S={A;, 42 }
sụch that 4; 4,= Ø (i#j).Lj=1,2, LJ7>¡ 4=5 and P(4,)=a>0,V,,
where P(A) refers to the probability assigned to the event A Then P(S)=
Vi, P(A) =>, a> 1 (see below), which is an absurd probability, being greater than one; similar inconsistencies arise when S is uncountable Apart from these inconsistencies sometimes we are not interested in all the subsets
of S Hence, we need to define # independently of the power set by endowing it with a mathematical structure which ensures that no inconsistencies arise This is achieved by requiring that # has a special mathematical structure, it is a o-field related to S
Definition 4
Let ¥ be a set of subsets of S ¥ is called a o-field if
(i) Ae #, then Ae F — closure under complementation; and (ii) A,eF i=1, 2, , then (JX, A)eF — closure under
countable union
Note that (i) and (ii) taken together imply the following:
(iii) Se because Au A=S;
(iv) QeF (from (iii) S=@ eF); and
(v) 4,ce.#i=l,2, , then ((\5¡ 4,) e2
These suggest that a o-field is a set of subsets of S which is closed under complementation, and countable unions and intersections That is, any of these operations on the elements of ¥ will give rise to an element of A It can
be checked that the power set of S is indeed a a-field, and so is the set
F, ={((HT)}, (HA), (TH), (TT)} © S}
but the set C= {{(HT).(TH)}} is not because @¢C, SEC, ((HT).(TH) )ựC
What we can do, however, in the latter case is to start from C and construct
the minimal o-field generated by its elements This can be achieved by extending C to include all the events generated by set theoretic operations
(unions, intersections, complementations) on the elements of C Then the
minimal o-field generated by C is A%={S, @, (HT), (TH)} (HH) (TT) and we denote it by #=alC)
This way of constructing a o-field can be very useful in cases where the events of interest are fewer than the ones given by the power set in the case of
a finite § For example if we are interested in events with one of each H or T there is no point in defining the o-field to be the power set, and ¥, can do as well with fewer events to attribute probabilities to The usefulness of this method of constructing o-fields is much greater in the case where S is either infinite or uncountable: in such cases this method is indispensable Let us
Trang 103.2 The axiomatic approach 41
consider an example where S$ is uncountable and discuss the construction of such a o-field
Example
Let S be the real line R={x: — x <x< x} and the set of events of interest
be
J={B:xelR} where Ö,=(z:z<x;=(%, —xÌ (3.3) This is an educated choice, which will prove to be very useful in the sequel How can we construct a o-field on IR? The definition ofa o-field suggests
that if we start from the events B,, x © R then extend this set to include B, and take countable unions of B, and B, we should be able to define a o-field
on R, o(J) — the minimal a-field generated by the events B,, xeER By
definition B, ¢a(J) If we take complements of B,: B,= {2 ze R, z>x}= (x, xjeo(J) Taking countable unions of B,: Um ¡(—#, x-(l/m]]= (—a,x)éo(J/) These imply that o(J) is indeed a o-field In order to see how
large a collection o(J) is we can show that events of the form (x, a), [x, %), (x, 2) forx<z,and {xj also belong to o(J), using set theoretic operations as
follows:
œ:
{x}= () (« x "ái cø(J)
ne
This shows not only that o(J/) is a a-field but it includes almost every conceivable subset (event) of R, that is, it coincides with the o-field generated by any set of subsets of R, which we denote by 4¥, Le o(J)= 4 The
a-field # will play a very important role in the sequel; we cal] it the Borel
field on R
Having solved the technical problem of possible inconsistencies in attributing probabilities to events by postulating the existence of a a-field
F associated with the sample space S$, Kolmogorov went on to formalise
the concept of probability itself.