Conditional Probability and Independence

In order to define the concept of a conditional probability, it is necessary to discuss joint probabilities and marginal probabilities. A joint probability is the probability of two random events. For example, consider drawing two cards from the deck of cards. There are 52×51 = 2,652 different combinations of the first two cards from the deck. The marginal probability is the overall probability of a single event or the probability of drawing a given card. The conditional probability of an event is the probability of that event given that some other event has occurred. Taking the roll of a single die, for example – what is the probability of the die being a one if you know that the face number is odd? (1/3). However, note that if you know that the roll of the die is a one, the probability of the roll being odd is 1.

As a starting point, consider the requirements (axioms) for a conditional probability to be valid.

Definition 3.13. Axioms of Conditional Probability:

1. P (A|B)≥0 for any eventA.

2. P (A|B) = 1 for any eventA⊃B.

3. If{Ai∩B} i= 1,2, . . . are mutually exclusive, then

P (A1∪A2∪. . .) = P (A1|B) + P (A2|B) +ã ã ã (3.37) 4. IfB⊃H,B ⊃G, and P (G)6= 0 then

P (H|B)

P (G|B) =P (H)

P (G). (3.38)

Note that Axioms 1 through 3 follow the general probability axioms with the addition of a conditional term. The new axiom (Axiom 4) states that two events conditioned on the same probability set have the same relationship as the overall (as we will develop shortly – marginal) probabilities. Intuitively,

the conditioning set brings in no additional information about the relative likelihood of the two events.

Theorem 3.14 provides a formal definition of conditional probability.

Theorem 3.14. P (A|B) = P (A∩B)/P (B) for any pair of events A and B such that P (B)>0.

Taking this piece by piece – P (A∩B) is the probability that both Aand B will occur (i.e., the joint probability ofAandB). Next, P (B) is the probability that B will occur. Hence, the conditional probability P (A|B) is defined as the joint probability of A and B given that we know that B has occurred.

Some texts refer to Theorem 3.14 as Bayes’ theorem; however, in this text we will define Bayes’ theorem as depicted in Theorem 3.15.

Theorem 3.15 (Bayes’ Theorem). Let Events A1, A2, . . . An be mutually exclusive events such thatP (A1∪A2∪ ã ã ãAn) = 1andP (Ai)>0for eachi.

Let E be an arbitrary event such thatP (E)>0. Then P (Ai|E) = P (E|A) P (Ai)

j=1

P (E|Aj) P (Aj)

. (3.39)

While Equation 3.39 appears different from the specification in Theorem 3.14, we can demonstrate that they are the same concept. First, let us use the relationship in Theorem 3.14 to define the probability of the joint eventE∩Ai. P (E∩Ai) = P (E|Ai) P (Ai). (3.40) Next, if we assume that eventsA1, A2,ã ã ã are mutually exclusive and exhaus- tive, we can rewrite the probability of eventE as

P (E) =

i=1

P (E|Ai) P (Ai). (3.41) Combining the results of Equations 3.40 and 3.41 yields the friendlier version of Bayes’ theorem found in Thereom 3.14:

P (Ai|E) =P (E∩Ai)

P (E) . (3.42)

Notice the direction of the conditional statement – if we know that event E has occurred, what is the probability that eventAi will occur?

Given this understanding of conditional probability, it is possible to define statistical independence. One random variable is independent of the probability of another random variable if

Definition 3.16. Events A and B are said to be independent if P (A) = P (A|B).

Hence, the random variable A is independent of the random variable B if knowing the value ofB does not change the probability ofA. Extending the scenario to the case of three random variables:

Definition 3.17. EventsA, B, and C are said to be mutually independent if the following equalities hold:

a) P (A∩B) = P (A) P (B) b) P (A∩C) = P (A) P (C) c) P (B∩C) = P (B) P (C)

d) P (A∩B∩C) = P (A) P (B) P (C)

3.3.1 Conditional Probability and Independence for Discrete Random Variables

In order to develop the concepts of conditional probability and independence, we start by analyzing the discrete bivariate case. As a starting point, we define the marginal probability of a random variable as the probability that a given value of one random variable will occur (i.e.,X =xi) regardless of the value of the other random variable. For this discussion, we simplify our notation slightly so that P [X =xi∩Y =yj] = P [X=xi, Y =yj] = P [xi, yj]. The marginal distributionforxi can then be defined as

P [xi] =

j=1

P [xi, yj]. (3.43)

Turning to the binomial probability presented in Table 3.3, the marginal probability thatX=x1(i.e.,X = 0) can be computed as

P [x1] = P [x1|y1] + P [x1|y2] +ã ã ãP [x1|y6]

= 0.01315 + 0.04342 + 0.05790 + 0.03893 + 0.01300 + 0.00158 = 0.16798.

(3.44) By repetition the marginal value for eachXiandYj is presented in Table 3.3.

Applying a discrete form of Bayes’ theorem, P [xi|yj] =P (xi, yj)

P (yj) (3.45)

we can compute the conditional probability ofX = 0 givenY = 2 as P [x1|y3] = 0.581

0.3456 = 0.16881. (3.46)

TABLE3.3 BinomialProbability yMarginal x012345Probability 00.013150.043420.057900.038930.013000.001580.16798 10.028180.093040.124080.083430.027860.003390.35998 20.024170.079790.106400.071550.023890.002900.30870 30.010390.034300.045740.030750.010270.001250.13270 40.002220.007330.009780.006580.002200.000270.02838 50.000180.000590.000790.000530.000180.000000.00227 Marginal Probability0.078290.258470.344690.231770.077400.00939

TABLE 3.4

Binomial Conditional Probabilities

X P [X, Y = 2] P [Y = 2] P [X Y = 2] P [X] 0 0.05790 0.34469 0.16798 0.16798 1 0.12408 0.34469 0.35998 0.35998 2 0.10640 0.34469 0.30868 0.30870 3 0.04574 0.34469 0.13270 0.13270 4 0.00978 0.34469 0.02837 0.02838 5 0.00079 0.34469 0.00229 0.00227

Table 3.4 presents the conditional probability for each value ofXgivenY = 2.

Next, we offer a slightly different definition of independence for the discrete bivariate random variable.

Definition 3.18. Discrete random variables are said to be independent if the events X = xi and Y = yj are independent for all i, j. That is to say, P (xi, yj) = P (xi) P (yj).

To demonstrate the consistency of Definition 3.18 with Definition 3.16, note that

P [xi] = P [xi|yj]⇒P [xi] = P [xi, yj]

P [yj] . (3.47) Therefore, multiplying each side of the last equality in Equation 3.47 yields P [xi]×P [yj] = P [xi, yj].

Thus, we determine independence by whether the P [xi, yj] values equal P [xi]×P [yj]. Taking the first case, we check to see that

P [x1]×P [y1] = 0.1681×0.0778 = 0.0131 = P [x1, y1]. (3.48) Carrying out this process for each cell in Table 3.3 confirms the fact that X andY are independent. This result can be demonstrated in a second way (more consistent with Definition 3.18). Note that the P [X|Y = 2] column in Table 3.4 equals the P [X] column – the conditional is equal to the marginal in all cases.

Next, we consider the discrete form of the uncorrelated normal distribution as presented in Table 3.5. Again, computing the conditional distribution ofX such thatY = 2 yields the results in Table 3.6.

Theorem 3.19. Discrete random variablesX andY with the probability distribution given in Table 3.1 are independent if and only if every row is proportional to any other row, or, equivalently, every column is proportional to any other column.

Finally, we consider a discrete form of the correlated normal distribution in Table 3.7. To examine whether the events are independent, we compute the conditional probability for X whenY = 2 and compare this conditional

TABLE3.5 UncorrelatedDiscreteNormal yMarginal x012345Probability 00.006100.017080.025200.017070.005290.000800.07154 10.020580.057630.085030.057610.017860.002710.24142 20.031930.089400.131910.089360.027700.004210.37451 30.020540.057520.084880.057500.017830.002710.24098 40.005470.015310.022590.015300.004740.000720.06413 50.000630.001770.002610.001770.000550.000080.00741 Marginal Probability0.085250.238710.352220.238610.073970.01123

TABLE 3.6

Uncorrelated Normal Conditional Probabilities

X P [X, Y = 2] P [Y = 2] P [X Y = 2] P [X] 0 0.02520 0.35222 0.07155 0.07154 1 0.08503 0.35222 0.24141 0.24142 2 0.13191 0.35222 0.37451 0.37451 3 0.08488 0.35222 0.24099 0.24098 4 0.02259 0.35222 0.06414 0.06413 5 0.00261 0.35222 0.00741 0.00741

distribution with the marginal distribution of X. The results presented in Table 3.8 indicate that the random variables are not independent.

3.3.2 Conditional Probability and Independence for Continuous Random Variables

The development of conditional probability and independence for continuous random variables follows the same general concepts as discrete random variables. However, constructing the conditional formulation for continuous variables requires some additional mechanics. Let us start by developing the conditional densityfunction.

Definition 3.20. Let X have density f(x). The conditional density of X givena≤X≤b, denoted byf(x|a≤X ≤b), is defined by

f(x|a≤X ≤b) = f(x) Z b

f(x)dx

fora≤x≤b,

= 0 otherwise.

(3.49)

Notice that Definition 3.20 defines the conditional probability for a single continuous random variable conditioned on the fact that the random variable is in a specific range (a≤X ≤b). This definition can be expanded slightly by considering any general range of the random variableX (X ∈S).

Definition 3.21. LetX have the densityf(x) and let S be a subset of the real line such that P (X∈S) >0. Then the conditional density of X given X ∈S, denoted byf(x|S), is defined by

f(x|S) = f(x)

P (X ∈S)forx∈S

= 0 otherwise.

(3.50)

TABLE3.7 CorrelatedDiscreteNormal yMarginal x012345Probability 00.013260.026450.026320.010820.001910.000140.07890 10.026470.069650.087740.045870.009910.000930.24057 20.026030.087490.135290.087330.023280.002740.36216 30.010860.045630.087110.069640.023040.003340.23962 40.001870.009840.023430.023200.009500.001780.06962 50.000130.000880.002710.003320.001720.000400.00916 Marginal Probability0.078620.239940.362600.240180.069360.00933

TABLE 3.8

Correlated Normal Conditional Probabilities

X P [X, Y = 2] P [Y = 2] P[X Y = 2] P [X] 0 0.02632 0.36260 0.07259 0.07890 1 0.08774 0.36260 0.24197 0.24057 2 0.13529 0.36260 0.37311 0.36216 3 0.08711 0.36260 0.24024 0.23962 4 0.02343 0.36260 0.06462 0.06962 5 0.00271 0.36260 0.00747 0.00916

To develop the conditional relationship between two continuous random variables (i.e.,f(x|y)) using the general approach to conditional density functions presented in Definitions 3.20 and 3.21, we have to define the marginal density (or marginal distribution) of continuous random variables.

Theorem 3.22. Let f(x, y)be the joint density of X andY and letf(x)be the marginal density ofX. Then

f(x) = Z ∞

−∞

f(x, y)dy. (3.51)

Going back to the distribution function from Example 3.11, we have

f(x, y) =xye−(x+y). (3.52)

To prove that this is a proper distribution function, we limit our consideration to non-negative values of x and y (i.e., f(x, y) ≥ 0 if x, y ≥ 0). From our previous discussion it is also obvious that

Z ∞ 0

f(x, y)dxdy = Z ∞

xe−xdx

Z ∞ 0

ye−ydy

− xe−x

∞ 0 +

Z ∞ 0

e−xdx − ye−y

∞ 0 +

Z ∞ 0

e−ydy

= (−(∞ ã0−0ã1)−(0−1)) (−(∞ ã0−0ã1)−(0−1)) = 1.

(3.53)

Thus, this is a proper density function. The marginal density function forx follows this formulation:

f(x) = Z ∞

f(x, y)dy= xe−x Z ∞

ye−ydy

= xe−x

− ye−y

∞ 0 +

Z ∞ 0

e−ydy

=xe−x.

(3.54)

, f x y

FIGURE 3.2

Quadratic Probability Density Function.

Example 3.23. Consider the continuous bivariate distribution function f(x, y) =3

2 x2+y2

f or x, y∈[0,1] (3.55) which is depicted graphically in Figure 3.2. First, to confirm that Equa- tion 3.55 is a valid distribution function,

3 2

Z 1 0

x2+y2 dx= 3

2 1 3x3+1

3y3

=3 2

1 3+1

= 1.

(3.56)

Further, f(x, y) ≥ 0 for allx, y ∈ [0,1]. To prove this rigorously we would show thatf(x, y) is at a minimum at{x, y} ={0,0}and that the derivative off(x, y) is positive for allx, y∈[0,1].

This example has a characteristic that deserves discussion. Notice that f(1,1) = 3>1; thus, while the axioms of probability require thatf(x, y)≥0, the function can assume almost any positive value as long as it integrates to one. Departing from the distribution function in Equation 3.55 briefly, consider the distribution functiong(z) = 2 forz∈[0,1/2]. This is a uniform distribution function with a more narrow range than the U [0,1]. It is valid becauseg(z)≥0 for all zand

Z 1/2 0

2dz= 2

z|1/20 = 2 1

2−0

= 1. (3.57)

Hence, even though a distribution function has values greater than one, it may still be a valid density function.

Returning to the density function defined in Equation 3.55, we derive the marginal density forx:

f(x) = Z 1

2 x2+y2 dy

=3 2x2

Z 1 0

dy+3 2

Z 1 0

y2dy

= 3

2x2(y|10+3 2

1 3y3

= 3 2x2+1

(3.58)

While the result of Equation 3.58 should be a valid probability density function by definition, it is useful to make sure that the result conforms to the axioms of probability (e.g., it provides a check on your mathematics). First, we note that f(x)≥0 for all x∈[0,1]. Technically, f(x) = 0 if x /∈[0,1]. Next, to verify that the probability is one for the entire sample set,

Z 1 0

3 2x2+1

=3 2

Z 1 0

x2dx+1 2

Z 1 0

= 3 2

1 3x3

+1 2

x|10

= 1 2 +1

2 = 1.

(3.59)

Thus, the marginal distribution function from Equation 3.58 meets the criteria for a probability measure.

Next, we consider the bivariate extension of Definition 3.21.

Definition 3.24. Let (X, Y) have the joint density f(x, y) and let S be a subset of the plane which has a shape as in Figure 3.3. We assume that P [(X, Y)∈S] > 0. Then the conditional density of X given (X, Y) ∈ S, denotedf(x|S), is defined by

f(x|S) =









 Z g(x)

h(x)

f(x, y)dy

P [(X, Y)∈S] fora≤x≤b, 0 otherwise.

(3.60)

x y

, 1

f x y

FIGURE 3.3

Conditional Distribution for a Region of a Bivariate Uniform Distribution.

Building on Definition 3.24, consider the conditional probability ofx < y for the bivariate uniform distribution as depicted in Figure 3.3.

Example 3.25. Suppose f(x, y) = 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and 0 otherwise. Obtainf(x|X < Y).

f(x|X < Y) = Z 1

dy Z 1

Z 1 x

dydx

= (y|1x Z 1

(y|1xdx

= 1−x

Z 1 0

(1−x)dx

= 1−x

x−1

2x2

= 2 (1−x).

(3.61)

x y

, 1

f x y Y y1 cX

Y y1 cX

FIGURE 3.4

Conditional Distribution of a Line for a Bivariate Uniform Distribution.

Notice the downward sloping nature of Equation 3.61 is consistent with the area of the projection in the upper right diagram of Figure 3.3. Initially each increment of y implies a fairly large area of probability (i.e., the difference 1−x). However, asy increases, this area declines.

Suppose that we are interested in the probability of X along a linear relationshipY =y1+cX. As a starting point, consider the simple bivariate uniform distribution that we have been working with where f(x, y) = 1.

We are interested in the probability of the line in that space presented in Figure 3.4. The conditional probability that X falls into [x1, x2] given Y = y1+cX is defined by

P (x1≤X≤x2|Y =y1+cX) =

y2lim→y1

P (x1≤X≤x2|y1+cX≤Y ≤y2+cX)

(3.62) for allx1,x2 satisfyingx1≤x2. Intuitively, as depicted in Figure 3.5, we are going to start by bounding the line on which we want to define the conditional probability (i.e., Y =y1+cX ≤ Y = y1∗+cX ≤Y =y2+CX). Then we are going to reduce the bound y1 →y2, leaving the relationship for y∗1. The conditional density ofX givenY =y1+cX, denoted by f(x|Y =y1+cX), if it exists, is defined to be a function that satisfies

P (x1≤X≤x2|Y =y1+cX) = Z x2

f(x|Y =y1+cX)dx. (3.63)

x y

Y y1cX

Y y1 cX Y y2cX

FIGURE 3.5

Bounding the Conditional Relationship.

In order to complete this proof we will need to use themean value theorem of integrals.

Theorem 3.26. Let f(x) be a continuous function defined on the closed interval [a, b]. Then there is some number X in that interval (a ≤ X ≤ b) such that

Z b a

f(x)dx= (b−a)f(X). (3.64) [48, p. 45]

The intuition for this proof is demonstrated in Figure 3.6. We don’t know what the value ofX is, but at least oneX satisfies the equality in Equation 3.64.

Theorem 3.27. The conditional density f(x|Y =y1+cX) exists and is given by

f(x|Y =y1+cX) = f(x, y1+cx) Z ∞

−∞

f(x, y+cx)dx

(3.65)

provided the denominator is positive.

x y

a X b

f X

f x

FIGURE 3.6

Mean Value of Integral.

Proof. We have

y2lim→y1

P(x1≤X ≤x2|y1+cX≤Y ≤y2+cX)

= lim

y2→y1

Z x2 x1

Z y2+cx y1+cx

f(x, y)dydx Z ∞

−∞

Z y2+cx y1+cx

f(x, y)dxdy .

(3.66)

Thus, by the mean value of integration,

y2lim→y1

Z x2 x1

Z y2+cx y1+cx

f(x, y)dydx Z ∞

−∞

Z y2+cx y1+cx

f(x, y)dxdy

= lim

y1→y2

Z x2

f(x, y∗+cx)dx Z ∞

−∞

f(x, y∗+cx)dx

(3.67)

wherey1≤y∗≤y2. Asy2→y1,y∗→y1, hence

y1lim→y2

Z x2

f(x, y∗+cx)dx Z ∞

−∞

f(x, y∗+cx)dx

= Z x2

f(x, y1+cx)dx Z ∞

−∞

f(x, y1+cx)dx

. (3.68)

The transition between Equation 3.67 and Equation 3.68 starts by assuming that for some

y∗∈[y1, y2]⇒

Z y2+cx y1+cx

f(x, y)dxdy =f(x, y∗+cx). (3.69) Thus, if we take the limit such thaty2→y1 andy1≤y∗≤y2, theny∗→y1 and

f(x, y∗+cx)→f(x, y1+cx). (3.70) Finally, we consider the conditional probability of X given that Y is re- stricted to a single point.

Theorem 3.28. The conditional density of X given Y = y1, denoted by f(x|y1), is given by

f(x|y1) =f(x, y1)

f(y1) . (3.71)

Note that a formal statement of Theorem 3.28 could follow Theorem 3.27, applying the mean value of the integral to a range ofX.

One would anticipate that continuous formulations of independence could follow the discrete formulation such that we attempt to show that f(x) = f(x|y). However, independence for continuous random variables simply re- lates to the separability of the joint distribution function.

Definition 3.29. Continuous random variables X and Y are said to be independent iff(x, y) =f(x)f(y) for allxandy.

Again returning to Example 3.11,

f(x, y) =xyexp [−(x+y)] = (xexp [−x]) (yexp [−y]). (3.72) Hence, X and Y are independent. In addition, the joint uniform distribution function is independent because f(x, y) = 1 = g(x)h(y) where g(x) = h(y) = 1. This simplistic definition of independence can be easily extended toT random variables.

Definition 3.30. A finite set of continuous random variablesX, Y, Z,ã ã ã are said to be mutually independent if

f(x, y, z,ã ã ã) =g(x)h(y)i(z)ã ã ã. (3.73) A slightly more rigorous statement of independence for bivariate continuous random variables is presented in Theorem 3.31.

Theorem 3.31. Let S be a subset of the plane such thatf(x, y)>0 overS andf(x, y) = 0outside of S. Then X and Y are independent if and only if S is a rectangle (allowing −∞ or ∞ to be an end point) with sides parallel to the axes andf(x, y) =g(x)/h(y)overS, whereg(x)andh(y)are some functions ofxandy, respectively. Note thatg(x) =cf(x)for somec,h(y) = c−1f(y).

Conditional Probability and Independence

Two Definitions of Probability for Econometrics