However, several members of the board said they'd like to see students equipped with knowledge of cutting edge applications of computer science to areas such as decision analysis, risk m
Trang 2Probabilistic Methods
for Financial and Marketing Informatics
Trang 3Morgan Kaufmann Publishers is an imprint of Elsevier
500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper
@ 2007 by Elsevier Inc All rights reserved
Designations used by companies to distinguish their products are often claimed
as trademarks or registered trademarks In all instances in which Morgan Kauf- mann Publishers is aware of a claim, the product names appear in initial capital
or all capital letters Readers, however, should contact the appropriate compa- nies for more complete information regarding trademarks and registration
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means electronic, mechanical, photocopy- ing, scanning, or otherwise-without prior written permission of the publisher Permissions may be sought directly from Elsevier's Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.com You may also complete your request on- line via the Elsevier homepage (http://elsevier.com), by selecting "Support Contact" then "Copyright and Permission" and then "Obtaining Permissions."
Library of Congress Cataloging-in-Publication Data
Working together to grow
libraries in developing countries
www.elsevier.com I www.bookaid.org I www.sabre.org
Trang 4P r e f a c e
This book is based on a course I recently developed for computer science ma- jors at Northeastern Illinois University (NEIU) The motivation for developing this course came from guidance I obtained from the NEIU Computer Science Department Advisory Board One objective of this Board is to advise the De- partment concerning the maintenance of curricula that is relevant to the needs
of companies in Chicagoland The Board consists of individuals in IT depart- ments from major companies such as Walgreen's, AON Company, United Air- lines, Harris Bank, and Microsoft After the dot.com bust and the introduction
of outsourcing, it became evident that students, trained only in the fundamen- tals of computer science, programming, web design, etc., often did not have the skills to compete in the current U.S job market So I asked the Advisory Board what else the students should know The board unanimously felt the students needed business skills such as knowledge of IT project management, marketing, and finance As a result, our revised curriculum, for students who hoped to obtain employment immediately following graduation, contained a number of business courses However, several members of the board said they'd like to see students equipped with knowledge of cutting edge applications of computer science to areas such as decision analysis, risk management, data mining, and market basket analysis I realized that some of the best work in these areas was being done in my own field, namely Bayesian networks After consulting with colleagues worldwide and checking on topics taught in similar programs
at other universities, I decided it was time for a course on applying probabilis- tic reasoning to business problems So my new course called "Informatics for MIS Students" and this book called Probabilistic Methods for Financial and Marketing Informatics were conceived
Part I covers the basics of Bayesian networks and decision analysis Much
of this material is based on my 2004 book Learning Bayesian Networks How- ever, I've tried to make the material more accessible Rather than dwelling on rigor, algorithms, and proofs of theorems, I concentrate on showing examples and using the software package Netica to represent and solve problems The specific content of Part I is as follows: Chapter 1 provides a definition of in- formatics and probabilistic informatics Chapter 2 reviews the probability and statistics needed to understand the remainder of the book Chapter 3 presents Bayesian networks and inference in Bayesian networks Chapter 4 concerns learning Bayesian networks from data Chapter 5 introduces decision analysis
iii
Trang 5in these sections seems inherently more difficult than most of the other material
in the book However, they do not require as background the material from Part I that is marked with a star ( ~ ) Chapter 8 discusses modeling real options, which concerns decisions a company must make as to what projects it should pursue Chapter 9 covers venture capital decision making, which is the process of deciding whether to invest money in a start-up company Chapter
10 discusses a model for bankruptcy prediction
Part III contains chapters on two important areas of marketing First, Chapter 11 shows methods for doing collaborative filtering and market basket analysis These disciplines concern determining what products an individual might prefer based on how the individual feels about other products Finally, Chapter 12 presents a technique for doing targeted advertising, which is the process of identifying those customers to whom advertisements should be sent There is too much material for me to cover the entire book in a one semester course at NEIU Since the course requires discrete mathematics and business statistics as prerequisites, I only review most of the material in Chapter 2 However, I do discuss conditional independence in depth because ordinarily the students have not been exposed to this concept I then cover the following sections from the remainder of the book:
Trang 6The course is titled "Informatics for MIS Students," and is a required course
in the MIS (Management Information Science) concentration of NEIU's Com- puter Science M.S Degree Program This book should be appropriate for any similar course in an MIS, computer science, business, or MBA program It is intended for upper level undergraduate and graduate students Besides having taken one or two courses covering basic probability and statistics, it would be useful but not necessary for the student to have studied data structures Part I of the book could also be used for the first part of any course involving probabilistic reasoning using Bayesian networks That is, although many of the examples in Part I concern the stock market and applications to business problems, I've presented the material in a general way Therefore, an instructor could use Part I to cover basic concepts and then provide papers relative to
a particular domain of interest For example, if the course is "Probabilistic Methods for Medical Informatics," the instructor could cover Part I of this book, and then provide papers concerning applications in the medical domain For the most part, the applications discussed in Part II were the results
of research done at the School of Business of the University of Kansas, while the applications in Part III were the results of research done by the Machine Learning and Applied Statistics Group of Microsoft Research The reason is not that I have any particular affiliations with either of this institutions Rather, I did an extensive search for financial and marketing applications, and the ones
I found that seemed to be most carefully designed and evaluated came from these institutions
I thank Catherine Shenoy for reviewing the chapter on investment science and Dawn Homes, Francisco Javier Dfez, and Padmini Jyotishmati for reviewing the entire book They all offered many useful comments and criticisms I thank Prakash Shenoy and Edwin Burmeister for correspondence concerning some of the content of the book I thank my co-author, Xia Jiang, for giving me the idea to write this book in the first place, and for her efforts on the book itself Finally, I thank Prentice Hall for granting me permission to reprint material
Rich Neapolitan RE-Neapolit an@neiu, ed u
Trang 7This Page Intentionally Left Blank
Trang 8C o n t e n t s
I B a y e s i a n N e t w o r k s a n d D e c i s i o n A n a l y s i s
1.1 W h a t Is I n f o r m a t i c s ? 4
1.2 P r o b a b i l i s t i c I n f o r m a t i c s 6
1.3 O u t l i n e of T h i s B o o k 7
2 Probability and Statistics 9 2.1 P r o b a b i l i t y B a s i c s 9
2.1.1 P r o b a b i l i t y S p a c e s 10
2.1.2 C o n d i t i o n a l P r o b a b i l i t y a n d I n d e p e n d e n c e 12
2.1.3 B a y e s ' T h e o r e m 15
2.2 R a n d o m V a r i a b l e s 16
2.2.1 P r o b a b i l i t y D i s t r i b u t i o n s of R a n d o m V a r i a b l e s 16
2.2.2 I n d e p e n d e n c e of R a n d o m V a r i a b l e s 21
2.3 T h e M e a n i n g of P r o b a b i l i t y 24
2.3.1 R e l a t i v e F r e q u e n c y A p p r o a c h t o P r o b a b i l i t y 25
2.3.2 S u b j e c t i v e A p p r o a c h t o P r o b a b i l i t y 28
2.4 R a n d o m V a r i a b l e s in A p p l i c a t i o n s 30
2.5 S t a t i s t i c a l C o n c e p t s 34
2.5.1 E x p e c t e d V a l u e 34
2.5.2 V a r i a n c e a n d C o v a r i a n c e 35
2.5.3 L i n e a r R e g r e s s i o n 41
3 Bayesian Networks 53 3.1 W h a t Is a B a y e s i a n N e t w o r k ? 54
3.2 P r o p e r t i e s of B a y e s i a n N e t w o r k s 56
3.2.1 D e f i n i t i o n of a B a y e s i a n N e t w o r k 56
3.2.2 R e p r e s e n t a t i o n of a B a y e s i a n N e t w o r k 59
3.3 C a u s a l N e t w o r k s as B a y e s i a n N e t w o r k s 63
3.3.1 C a u s a l i t y 63
3.3.2 C a u s a l i t y a n d t h e M a r k o v C o n d i t i o n 68
Trang 9viii C O N T E N T S
3.4
3.5
3.6
3.3.3 T h e M a r k o v C o n d i t i o n w i t h o u t C a u s a l i t y 71
Inference in Bayesian Networks 72
3.4.1 E x a m p l e s of Inference 73
3.4.2 Inference A l g o r i t h m s a n d Packages 75
3.4.3 Inference Using N e t i c a 77
How Do We O b t a i n t h e P r o b a b i l i t i e s ? 78
3.5.1 T h e Noisy O R - G a t e Model 79
3.5.2 M e t h o d s for Discretizing C o n t i n u o u s Variables * 86
E n t a i l e d C o n d i t i o n a l I n d e p e n d e n c i e s * 92
3.6.1 E x a m p l e s of E n t a i l e d C o n d i t i o n a l I n d e p e n d e n c i e s 92
3.6.2 d - S e p a r a t i o n 95
3.6.3 Faithful a n d Unfaithful P r o b a b i l i t y D i s t r i b u t i o n s 99
3.6.4 M a r k o v B l a n k e t s a n d B o u n d a r i e s 102
Learning Bayesian Networks 4.1 4.2 4.3 4.4 4.5 4.6 4.7 111 P a r a m e t e r L e a r n i n g 112
4.1.1 L e a r n i n g a Single P a r a m e t e r 112
4.1.2 L e a r n i n g All P a r a m e t e r s in a Bayesian N e t w o r k 119
L e a r n i n g S t r u c t u r e (Model Selection) 126
Score-Based S t r u c t u r e L e a r n i n g * 127
4.3.1 L e a r n i n g S t r u c t u r e Using t h e Bayesian Score 127
4.3.2 Model A v e r a g i n g 137
C o n s t r a i n t - B a s e d S t r u c t u r e L e a r n i n g 138
4.4.1 L e a r n i n g a D A G Faithful to P 138
4.4.2 L e a r n i n g a D A G in W h i c h P Is E m b e d d e d Faithfully ~ 144
C a u s a l L e a r n i n g 145
4.5.1 Causal Faithfulness A s s u m p t i o n 145
4.5.2 Causal E m b e d d e d Faithfulness A s s u m p t i o n ~ 148
Software Packages for L e a r n i n g 151
E x a m p l e s of L e a r n i n g 153
4.7.1 L e a r n i n g B a y e s i a n Networks 153
4.7.2 Causal L e a r n i n g 162
Decision Analysis Fundamentals 5.1 5.2 5.3 1 7 7 Decision Trees 178
5.1.1 Simple E x a m p l e s 178
5.1.2 Solving More C o m p l e x Decision Trees 182
Influence D i a g r a m s 195
5.2.1 R e p r e s e n t i n g w i t h Influence D i a g r a m s 195
5.2.2 Solving Influence D i a g r a m s 202
5.2.3 Techniques for Solving Influence D i a g r a m s * 202
5.2.4 Solving Influence D i a g r a m s Using N e t i c a 207
D y n a m i c Networks * 212
5.3.1 D y n a m i c Bayesian Networks 212
5.3.2 D y n a m i c Influence D i a g r a m s 219
Trang 10Further Techniques in Decision Analysis 229
6.1 M o d e l i n g R i s k P r e f e r e n c e s 230
6.1.1 T h e E x p o n e n t i a l U t i l i t y F u n c t i o n 231
6.1.2 A D e c r e a s i n g R i s k - A v e r s e U t i l i t y F u n c t i o n 235
6.2 A n a l y z i n g R i s k D i r e c t l y 236
6.2.1 U s i n g t h e V a r i a n c e t o M e a s u r e Risk 236
6.2.2 R i s k Profiles 238
6.3 D o m i n a n c e 240
6.3.1 D e t e r m i n i s t i c D o m i n a n c e 240
6.3.2 S t o c h a s t i c D o m i n a n c e 241
6.3.3 G o o d Decision versus G o o d O u t c o m e 243
6.4 S e n s i t i v i t y A n a l y s i s 244
6.4.1 S i m p l e M o d e l s 244
6.4.2 A M o r e D e t a i l e d M o d e l 250
6.5 Value of I n f o r m a t i o n 254
6.5.1 E x p e c t e d V a l u e of P e r f e c t I n f o r m a t i o n 255
6.5.2 E x p e c t e d Value of I m p e r f e c t I n f o r m a t i o n 257
6.6 N o r m a t i v e Decision A n a l y s i s 259
I I Financial Applications 265 Investment Science 267 7.1 Basics of I n v e s t m e n t Science 267
7.1.1 I n t e r e s t 267
7.1.2 N e t P r e s e n t Value 270
7.1.3 S t o c k s 271
7.1.4 P o r t f o l i o s 276
7.1.5 T h e M a r k e t P o r t f o l i o 276
7.1.6 M a r k e t I n d i c e s 277
7.2 A d v a n c e d Topics in I n v e s t m e n t Science ~ 278
7.2.1 M e a n - V a r i a n c e P o r t f o l i o T h e o r y 278
7.2.2 M a r k e t Efficiency a n d C A P M 285
7.2.3 F a c t o r M o d e l s a n d A P T 296
7.2.4 E q u i t y V a l u a t i o n M o d e l s 303
7.3 A B a y e s i a n N e t w o r k P o r t f o l i o Risk A n a l y z e r * 314
7.3.1 N e t w o r k S t r u c t u r e 315
7.3.2 N e t w o r k P a r a m e t e r s 317
7.3.3 T h e P o r t f o l i o Value a n d A d d i n g E v i d e n c e 319
M o d e l i n g Real Options 329 8.1 S o l v i n g R e a l O p t i o n s Decision P r o b l e m s 330
8.2 M a k i n g a P l a n 339
8.3 S e n s i t i v i t y A n a l y s i s 340
Trang 11Venture Capital Decision Making 343
9.1 A S i m p l e V C D e c i s i o n M o d e l 345
9.2 A D e t a i l e d V C D e c i s i o n M o d e l 347
9.3 M o d e l i n g R e a l D e c i s i o n s 350
9.A A p p e n d i x 352
10 B a n k r u p t c y Prediction 357 10.1 A B a y e s i a n N e t w o r k for P r e d i c t i n g B a n k r u p t c y 358
10.1.1 N a i v e B a y e s i a n N e t w o r k s 358
10.1.2 C o n s t r u c t i n g t h e B a n k r u p t c y P r e d i c t i o n N e t w o r k 358
10.2 E x p e r i m e n t s 364
10.2.1 M e t h o d 364
10.2.2 R e s u l t s 366
10.2.3 D i s c u s s i o n 369
III Marketing Applications 371 11 Collaborative Filtering 373 11.1 M e m o r y - B a s e d M e t h o d s 374
11.2 M o d e l - B a s e d M e t h o d s 377
11.2.1 P r o b a b i l i s t i c C o l l a b o r a t i v e F i l t e r i n g 377
11.2.2 A C l u s t e r M o d e l 378
11.2.3 A B a y e s i a n N e t w o r k M o d e l 379
11.3 E x p e r i m e n t s 380
11.3.1 T h e D a t a S e t s 380
11.3.2 M e t h o d 380
11.3.3 R e s u l t s 382
12 Targeted Advertising 387 12.1 C l a s s P r o b a b i l i t y T r e e s 388
12.2 A p p l i c a t i o n t o T a r g e t e d A d v e r t i s i n g 390
12.2.1 C a l c u l a t i n g E x p e c t e d Lift in P r o f i t 390
12.2.2 I d e n t i f y i n g S u b p o p u l a t i o n s w i t h P o s i t i v e E L P s 392
12.2.3 E x p e r i m e n t s 393
Trang 12About the Authors
Richard E Neapolitan is Professor and Chair of Computer Science at North- eastern Illinois University He has previously written three books including the seminal 1990 Bayesian network text Probabilistic Reasoning in Ezpert Systems
More recently, he wrote the 2004 text Learning Bayesian networks, and Foun-
of the most widely-used algorithms texts world-wide His books have the rep- utation of making difficult concepts easy to understand because of the logical flow of the material, the simplicity of the explanations, and the clear examples
Xia Jiang received an M.S in Mechanical Engineering from Rose Hulman Uni- versity and is currently a Ph.D candidate in the Biomedical Informatics Pro- gram at the University of Pittsburgh She has published theoretical papers concerning Bayesian networks, along with applications of Bayesian networks to biosurveillance
Trang 13This Page Intentionally Left Blank
Trang 14P a r t I
Bayesian Networks and
Decision Analysis
Trang 15This Page Intentionally Left Blank
Trang 16Chapter 1
P r o b a b i l i s t i c I n f o r m a t i c s
Informatics programs in the United States go back at least to the 1980s when Stanford University offered a Ph.D in medical informatics Since that time, a number of informatics programs in other disciplines have emerged at universities throughout the United States These programs go by various names, including bioinformatics, medical informatics, chemical informatics, music informatics, marketing informatics, etc What do these programs have in common? To an- swer that question we must articulate what we mean by the term "informatics." Since other disciplines are usually referenced when we discuss informatics, some define informatics as the application of information technology in the context of another field However, such a definition does not really tell us the focus of in- formatics itself First, we explain what we mean by the term informatics Then
we discuss why we have chosen to concentrate on the probabilistic approach in this book Finally, we provide an outline of the material that will be covered in the rest of the book
Trang 174 C H A P T E R 1 P R O B A B I L I S T I C I N F O R M A T I C S
1.1 W h a t Is I n f o r m a t i c s ?
In much of western Europe, informatics has come to mean the rough trans- lation of the English "computer science," which is the discipline that studies computable processes Certainly, there is overlap between computer science programs and informatics programs, but they are not the same Informatics programs ordinarily investigate subjects such as biology and medicine, whereas computer science programs do not So the European definition does not suffice for the way the word is currently used in the United States
To gain insight into the meaning of informatics, let us consider the suffix
"-ics," which means the science, art, or study of some entity For example,
"linguistics" is the study of the nature of language, "economics" is the study
of the production and distribution of goods, and "photonics" is the study of electromagnetic energy whose basic unit is the photon Given this, informatics should be the study of information Indeed, WordNet 2.1 defines informat- ics as "the science concerned with gathering, manipulating, storing, retrieving and classifying recorded information." To proceed from this definition we need
to define the word "information." Most dictionary definitions do not help as far as giving us anything concrete That is, they define information either as knowledge or as a collection of data, which means we are left with the situation
of determining the meaning of knowledge and data To arrive at a concrete definition of informatics, let's define data, information, and knowledge first
By datum we mean a character string that can be recognized as a unit For example, the nucleotide G in the nucleotide sequence GATC is a datum, the field "cancer" in a record in a medical data base is a datum, and the field "Gone with the Wind" in a movie data base is a datum Note that a single character, a word, or a group of words can be a datum depending on the particular application Data then are more than one datum By information
we mean the meaning given to data For example, in a medical data base the data "Joe Smith" and "cancer" in the same record mean that Joe Smith has cancer By knowledge we mean dicta which enable us to infer new information from existing information For example, suppose we have the following item of knowledge (dictum): 1
Finally, we define informatics as the discipline that applies the method- ologies of science and engineering to information It concerns organizing data
1 Such an item of knowledge would be part of a rule-based expert system
Trang 181.1 W H A T IS I N F O R M A T I C S ? 5
into information, learning knowledge from information, learning new informa- tion from existing information and knowledge, and making decisions based on the knowledge and information learned 9 We use engineering to develop the algorithms t h a t learn knowledge from information and t h a t learn information from information and knowledge We use science to test the accuracy of these algorithms
Next, we show several examples t h a t illustrate how informatics pertains to other disciplines
E x a m p l e 1.1 ( m e d i c a l i n f o r m a t i c s ) Suppose we have a large data file of patients records as follows:
n o
no
no yes
to obtain the new i n f o r m a t i o n that "there is a 5% chance Joe Smith also has lung cancer."
E x a m p l e 1.:] ( b i o i n f o r m a t i c s ) Suppose we have long homologous D N A se-
quences from the human, the chimpanzee, the gorilla, the orangutan, and the rhesus monkey From this i n f o r m a t i o n we can use the methodologies of infor- matics to obtain the new i n f o r m a t i o n that it is most probable that the human and the chimpanzee are the most closely related of the five species
E x a m p l e 1.3 ( m a r k e t i n g i n f o r m a t i c s ) Suppose we have a large data file
of movie ratings as follows:
Aviator Shall We Dance Dirty Dancing Vanity Fair
This means, for example, that Person 1 rated Aviator the lowest (1) and Shall
We Dance the highest (5) From the information in this data file, we can develop
Trang 196 C H A P T E R 1 P R O B A B I L I S T I C I N F O R M A T I C S
a knowledge system that will enable us to estimate how an individual will rate
a particular movie For example, suppose K a t h y Black rates Aviator as 1, Shall
We Dance as 5, and Dirty Dancing as 5 The system could estimate how K a t h y will rate Vanity Fair Just by eyeballing the data in the five records shown,
we see that K a t h y ' s ratings on the first three movies are similar to those of Persons 1, 4, and 5 Since they all rated Vanity Fair high, based on these five records, we would suspect K a t h y would rate it high A n informatics algorithm can formalize a way to make these predictions This task of predicting the utility of an item to a particular user based on the utilities assigned by other users is called c o l l a b o r a t i v e f i l t e r i n g
In this book we concentrate on two related areas of informatics, namely
financial informatics and marketing informatics F i n a n c i a l i n f o r m a t i c s in-
volves applying the methods of informatics to the management of money and other assets In particular, it concerns determining the risk involved in some financial venture As an example, we might develop a tool to improve portfolio risk analysis M a r k e t i n g i n f o r m a t i c s involves applying the methods of infor- matics to promoting and selling products or services For example, we might determine which advertisements should be presented to a given Web user based
on t h a t user's navigation pattern
Before ending this section, let's discuss the relationship between informatics and the relatively new expression "'data mining." The term d a t a mining can
be traced back to the First International Conference on Knowledge Discovery and Data Mining (KDD-95) in 1995 Briefly, d a t a m i n i n g is the process of
extrapolating unknown knowledge from a large amount of observational data Recall t h a t we said informatics concerns (1) organizing data into information, (2) learning knowledge from information, (3) learning new information from existing information and knowledge, and (4) making decisions based on the knowledge and information learned So, technically speaking, data mining is
a subfield of informatics that includes only the first two of these procedures However, both terms are still evolving, and some individuals use d a t a mining
to refer to all four procedures
1.2 P r o b a b i l i s t i c I n f o r m a t i c s
As can be seen in Examples 1.1, 1.2, and 1.3, the knowledge we use to process information often does not consist of I F - T H E N rules, such as the one concerning plants discussed earlier Rather, we only know relationships such as "smoking makes lung cancer more likely." Similarly, our conclusions are uncertain For example, we feel it is most likely t h a t the closest living relative of the human
is the chimpanzee, but we are not certain of this So ordinarily we must reason under uncertainty when handling information and knowledge In the 1960s and 1970s a number of new formalisms for handling uncertainty were developed, in- cluding certainty factors, the Dempster-Shafer Theory of Evidence, fuzzy logic, and fuzzy set theory Probability theory has a long history of representing un- certainty in a formal axiomatic way Neapolitan [1990] contrasts the various
Trang 20us to prove results based on assumptions concerning a system An example of
a heuristic algorithm is the one developed for collaborative filtering in Chapter
11, Section 11.1
An a b s t r a c t m o d e l is a theoretical construct that represents a physical process with a set of variables and a set of quantitative relationships (axioms) among them We use models so we can reason within an idealized framework and thereby make predictions/determinations about a system We can math- ematically prove these predictions/determinations are "correct," but they are correct only to the extent that the model accurately represents the system A
m o d e l - b a s e d a l g o r i t h m therefore makes predictions/determinations within the framework of some model Algorithms that make predictions/determinations within the framework of probability theory are model-based algorithms We can prove results concerning these algorithms based on the axioms of probability theory, which are discussed in Chapter 2 We concentrate on such algorithms
in this book In particular, we present algorithms that use Bayesian networks
to reason within the framework of probability theory
1.3 O u t l i n e o f T h i s B o o k
In Part I we cover the basics of Bayesian networks and decision analysis Chap- ter 2 reviews the probability and statistics necessary to understanding the re- mainder of the book In Chapter 3 we present Bayesian networks, which are graphical structures that represent the probabilistic relationships among many related variables Bayesian networks have become one of the most prominent at- chitectures for representing multivariate probability distributions and enabling probabilistic inference using such distributions Chapter 4 shows how we can learn Bayesian networks from data A Bayesian network augmented with a value node and decision nodes is called an influence diagram We can use an influence diagram to recommend a decision based on the uncertain relationships among the variables and the preferences of the user The field that investigates such decisions is called decision analysis Chapter 5 introduces decision analysis, while Chapter 6 covers further topics in decision analysis Once you have com- pleted Part I, you should have a basic understanding of how Bayesian networks and decision analysis can be used to represent and solve real-world problems Parts II and III then cover applications to specific problems Part II covers financial applications Specifically, Chapter 7 presents the basics of investment science and develops a Bayesian network for portfolio risk analysis In Chapter 2Fuzzy set theory and fuzzy logic model a different class of problems than probabil- ity theory and therefore complement probability theory rather than compete with it See [Zadeh, 1995] or [Neapolitan, 1992]
Trang 218 C H A P T E R 1 P R O B A B I L I S T I C I N F O R M A T I C S
8 we discuss the modeling of real options, which concerns decisions a company must make as to what projects it should pursue Chapter 9 covers venture cap- ital decision making, which is the process of deciding whether to invest money
in a start-up company In Chapter 10 we show an application to bankruptcy prediction Finally, Part III contains chapters on two of the most important areas of marketing First, Chapter 11 shows an application to collaborative filtering/market basket analysis These disciplines concern determining what products an individual might prefer based on how the user feels about other products Second, Chapter 12 presents an application to targeted advertising, which is the process of identifying those customers to whom advertisements should be sent
Trang 22Chapter 2
Probability and Statistics
This chapter reviews the probability and statistics you need to read the remainder of this book In Section 2.1 we present the basics of probability theory, while in Section 2.2 we review random variables Section 2.3 briefly discusses the meaning of probability In Section 2.4 we show how random variables are used in practice Finally, Section 2.5 reviews concepts in statistics, such as expected value, variance, and covariance
2.1 Probability Basics
After defining probability spaces, we discuss conditional probability, indepen- dence and conditional independence, and Bayes' Theorem
Trang 23c o m e s The set of all outcomes is called the s a m p l e s p a c e or population
M a t h e m a t i c i a n s ordinarily say "sample space," while social scientists ordinarily say "population." We will say sample space In this simple review we assume the sample space is finite Any subset of a sample space is called an e v e n t A subset containing exactly one element is called an e l e m e n t a r y e v e n t
E x a m p l e 2.1 Suppose we have the experiment of drawing the top card from
an ordinary deck of cards Then the set
E - {jack of hearts, jack of clubs, jack of spades, jack of diamonds}
is an event, and the set
D e f i n i t i o n 2.1 Suppose we have a sample space f~ containing n distinct ele- ments: that is,
~-~ - - { e l , e 2 , - , en}
A function that assigns a real number P ( [ ) to each event s C_ ~ is called
a p r o b a b i l i t y f u n c t i o n on the set of subsets of f~ if it satisfies the following conditions:
1 O < P(ei) <_ l for l < i < n
2 P ( e l ) + P(r + - V P(en) - 1
3 For each event that is not an elementary event, P ( E ) is the sum of the
probabilities of the elementary events whose outcomes are in E For
example, if
{e3, e6, e8}
Trang 242.1 P R O B A B I L I T Y B A S I C S 11
then
P ( E ) = P(e3) + P ( e 6 ) + P(es)
The pair (f~, P) is called a p r o b a b i l i t y space
Because probability is defined as a function whose domain is a set of sets,
we should write P ( { e i } ) instead of P(ei) when denoting the probability of an elementary event However, for the sake of simplicity, we do not do this In the same way, we write P(e3, e6, es) instead of P({e3, e6, e8})
The most straightforward way to assign probabilities is to use the P r i n c i p l e
o f I n d i f f e r e n c e , which says t h a t outcomes are to be considered equiprobable
if we have no reason to expect one over the other According to this principle, when there are n elementary events, each has probability equal to 1In
E x a m p l e 2.2 Let the experiment be tossing a coin Then the sample space is
ft = {heads, tails}, and, according to the Principle of Indifference, we assign
P(heads) = P ( t a i l s ) = .5
We stress t h a t there is nothing in the definition of a probability space t h a t says we must assign the value of 5 to the probabilities of heads and tails We could assign P ( h e a d s ) = 7 and P(tails)= .3 However, if we have no reason
to expect one outcome over the other, we give t h e m the same probability
E x a m p l e 2.3 Let the experiment be drawing the top card from a deck of 52 cards Then t2 contains the faces of the 52 cards, and, according to the Principle
of Indifference, we assign P(e) = 1/52 for each e E t2 For example,
1
POack of hearts) - 52 The event
E = {jack of hearts, jack of clubs, jack of spades, jack of diamonds} means that the card drawn is a jack Its probability is
P ( E ) = P g a c k of hearts)+ Pgack of dubs)+
P(jack of spades) + P(jack of diamonds)
We have Theorem 2.1 concerning probabi!ity spaces Its proof is left as an exercise
Trang 2512 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S
T h e o r e m 2.1 Let (f~, P) be a probability space Then
1 P ( f ~ ) - 1
2 0 _< P(E) _< 1 for every E C_ s
3 For every two subsets E and F of f~ such that E n F - 0,
P(E U F) - P(E) + P(F),
where 0 denotes the empty set
Example 2.4 Suppose we draw the top card from a deck of cards Denote by
Queen the set containing the 4 queens and by King the set containing the 4 kings Then
1 1 2
P(Queen U King) - P(Queen) + P(King) = ~ + 13 - I~
because Queen N King = ~ Next denote by Spade the set containing the 13 spades The sets Queen and Spade are not disjoint; so their probabilities are not additive However, it is not hard to prove that, in general,
P(E U F) - P(E) + P(F) - P(E n F)
So
P(Queen U Spade) P(Queen) + P(Spade) - P(Queen n Spade)
13 4 52 13
2.1.2 Conditional Probability and I n d e p e n d e n c e
We start with a definition
Definition 2.2 Let E and F be events such that P(F) ~ 0 Then the c o n d i -
t i o n a l p r o b a b i l i t y of E given F, denoted P(E[F), is given by
P(EN[)
P(EIF)- P(F) "
We can gain intuition for this definition by considering probabilities that are assigned using the Principle of Indifference In this case, P(E[F), as defined above, is the ratio of the number of items in E N F to the number of items in F
We show this as follows Let n be the number of items in the sample space, nF
be the number of items in F, and nEF be the number of items in E N F Then
P (E n F) nEF / n nEF
P(F) n F / n nF '
which is the ratio of the number of items in E n F to the number of items in
F As far as the meaning is concerned, P(EIF ) is our belief that E contains the outcome given that we know F contains the outcome
Trang 262.1 P R O B A B I L I T Y B A S I C S 13
E x a m p l e 2.5 Again, consider drawing the top card from a deck of cards Let Jack be the set of the 4 jacks, RedRoyalCard be the set of the 6 red royal cards, 1 and Club be the set of the 13 clubs Then
4 1 P(Jack)- ~ -
P (Jack I Red RoyalCard ) = P(Jack N RedRoyalCard) _ 2/52 _ _ 1
P(RedRoyalCard) 6/52 3 P(Jack A Club) 1/52 1
P(JacklClub)- P(Club) = 13/52 = 1-3"
Notice in the previous example t h a t P(JacklClub ) = P(Jack) This means
t h a t finding out the card is a club does not change the likelihood t h a t it is a jack
We say t h a t the two events are independent in this case, which is formalized in the following definition:
D e f i n i t i o n 2.3 Two events E and F are i n d e p e n d e n t if one of the following holds:
I P(EIF ) = P(E) and
E and F are independent if and only if P ( E C / F ) = P ( E ) P ( F )
If you've previously studied probability, you should have already been in- troduced to the concept of independence However, a generalization of inde- pendence, called conditional independence, is not covered in many introductory texts This concept is important to the applications discussed in this book We discuss it next
D e f i n i t i o n 2.4 Two events E and F are c o n d i t i o n a l l y i n d e p e n d e n t given G
if P(G) -r 0 and one of the following holds:
1 P(EIF C3 G) = P(EIG) and
2 P(EIG) = 0 or P(FIG ) = 0
P(EIG) r 0, P(FIG)r 0
Notice t h a t this definition is identical to the definition of independence ex- cept t h a t everything is conditional on G The definition entails t h a t E and F are independent once we know t h a t the outcome is in G The next example illustrates this
Trang 27E x a m p l e 2.6 Let ~ be the set of all objects in Figure 2.1 Using the Principle
of Indifference, we assign a probability of 1/13 to each object Let Black be the set of all black objects, White be the set of all white objects, Square be the set
of all square objects, and A be the set os all objects containing an "A." We then have
5
P(A) = 1 -3
3 P(AlSquare ) = g
So A and Square are not independent However,
P(AIBlack ) = P(AlSquare F1 Black) =
1
P(AlSquarerqWhite) = ~
So A and Square are also conditionally independent given White
Next, we discuss an important rule involving conditional probabilities Sup- pose we have n events Ez, E 2 , , En such that
Ei F1 Ej - 0 for i # j and
Ez U E 2 U U E n - ~ Such events are called m u t u a l l y exclusive a n d e x h a u s t i v e Then the law
of t o t a l p r o b a b i l i t y says for any other event F,
P(F) - P(FFq El) + P(F N E2) +. + P(F fq En) (2.1)
Trang 282.1 P R O B A B I L I T Y B A S I C S 15
You are asked to prove this rule in the exercises If P(Ei) # 0, then P(F N Ei) P(FIE~)P(Ei ) Therefore, if P(Ei) ~ 0 for all i, the law is often applied in the following form:
P(F) - P(FIE~)P(E~ ) + P(FIE2)P(E2) + + P(FIEn)P(E~) (2.2)
E x a m p l e 2.7 Suppose we have the objects discussed in Example 2.6 Then according to the law of total probability
P(A) - P(AIBlack)P(Black ) + P(A[White)P(White)
Furthermore, given n mutually exclusive and exhaustive events El, 1:2, , En
such that P(Ei) ~ 0 for all i, we have for 1 <_ i ~_ n,
P(FIE~)P(E~) P(E~IF ) - p(FIEI)P(E1) + P(FIE2)P(E2) + P(FIEn)P(En)" (2.4)
P r o o f To obtain Equality 2.3, we first use the definition of conditional prob- ability as follows:
P(EiF) - P(E A F) P(F) and p(FIE) _ P(F N E)
P(E)
Next we multiply each of these equalities by the denominator on its right side
to show that
P(EIF)P(F ) - P(FIE)P(E )
because they both equal P ( [ N F) Finally, we divide this last equality by P(F)
to obtain our result
To obtain Equality 2.4, we place the expression for F, obtained using the rule of total probability (Equality 2.2), in the denominator of Equality 2.3 9
Both of the formulas in the preceding theorem are called B a y e s ' T h e o r e m because the original version was developed by Thomas Bayes (published in 1763) The first enables us to compute P ( E [ F ) i f we know P(FIE), P(E), and P(F), while the second enables us to compute P(Ei[F) if we know P(FIEj) and
P ( E j ) for 1 _< j _< n The next example illustrates the use of Bayes' Theorem
Trang 29which i~ th~ ~ ~ .~lu~ we g~t by computing P(B,~r directly
In the previous example we can just as easily compute P(BlacklA ) directly
We will see a useful application of Bayes' Theorem in Section 2.4
2 2 R a n d o m V a r i a b l e s
In this section we present the formal definition and mathematical properties of
a random variable In Section 2.4 we show how they are developed in practice
2.2.1 P r o b a b i l i t y D i s t r i b u t i o n s of R a n d o m Variables Definition 2.5 Given a probability space (f~, P), a r a n d o m v a r i a b l e X is a function whose domain is f~
The range of X is called the space of X
E x a m p l e 2.9 Let f~ contain all outcomes of a throw os a pair os six-sided dice,
and let P assign 1/36 to each outcome Then f~ is the following set os ordered pairs:
The space of X is {2, 3, 4, 5, 6, r, 8, 9,10,11,12}, and that of Y is {odd, even}
Trang 30E x a m p l e 2 1 0 Let f~, P, and X be as in Example 2.9 Then
X - 3 represents the event { (1, 2), (2, 1) } and
X - x, Y - y r e p r e s e n t s t h e event
{e such t h a t X ( e ) - x} A {e such t h a t Y(e) = y}
E x a m p l e 2 1 2 Let f~, P, X , and Y be as in Example 2.9 Then
X - 4, Y - odd represents t h e e v e n t { ( 1 , 3 ) , ( 3 , 1 ) } ,
and so
P ( X - 4, Y - o d d ) - 1/18
Trang 3118 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S
We call P ( X - x, Y - y) the j o i n t p r o b a b i l i t y d i s t r i b u t i o n of X and
Y If A - {X, Y}, we also call this the j o i n t p r o b a b i l i t y d i s t r i b u t i o n of A Furthermore, we often just say "joint distribution" or "probability distribution." For brevity, we often use x, y to represent the event X = x, Y y, and so
we write P ( x , y) instead of P ( X x, Y y) This concept extends to three
or more r a n d o m variables For example, P ( X - x, Y - y, Z = z) is the joint
probability distribution function of the r a n d o m variables X , Y, and Z, and we
often write P ( x , y, z)
E x a m p l e 2 1 3 L e t ft, P , X , and Y be as in E x a m p l e 2.9 T h e n i f x = 4 and
y = odd,
P ( x , y) = P ( X = x, Y = y) = 1/18
If we w a n t to refer to all values of, for example, the r a n d o m variables X and
Y, we sometimes write P ( X , Y ) instead of P ( X = x, Y = y) or P ( x , y)
E x a m p l e 2 1 4 L e t ft, P, X , and Y be as in E x a m p l e 2.9 It is left as an exercise to s h o w t h a t for all values o f x and y we have
Trang 32In Equality 2.5 the probability distribution P ( X - x) is called the m a r g i n a l
p r o b a b i l i t y d i s t r i b u t i o n of X relative to the joint distribution P ( X - x, Y -
y) because it is obtained using a process similar to adding across a row or column
in a table of numbers This concept also extends in a straightforward way to three or more random variables For example, if we have a joint distribution
P ( X - x, Y - y, Z - z) of X, Y, and Z, the marginal distribution P ( X -
x, Y - y) of X and Y is obtained by summing over all values of Z If A - {X, Y}, we also call this the m a r g i n a l p r o b a b i l i t y d i s t r i b u t i o n of A The next example reviews the concepts covered so far concerning random variables
E x a m p l e 2.18 Let ~ be a set of 12 individuals, and let P assign 1/12 to each
Trang 33The joint distribution of S and H is as follows:
s h P(s,h) female 64 1/3
1/6 1/6 1/6
II 1/2 1/3 1/6 II
1/2
The table that follows shows the first few values in the joint distribution of S,
H, and W There are 18 values in all, many of which are 0
s h w P ( s , h , w ) female 64 30,000 1/6
female 64 40,000 1/6
female 64 50,000 0
female 68 30,000 1/12
We close with the chain rule for random variables, which says that given
n random variables X1, X 2 , , Xn, defined on the same sample space ~,
P(xl,x2, ,Xn) = P(xnlxn-~,Xn-2, ,Xl)"'" X P(x21x~) x P(x~)
whenever P ( x l , x 2 , , X n ) # O It is straightforward to prove this rule using the rule for conditional probability
Trang 34The notion of independence extends naturally to random variables
D e f i n i t i o n 2.6 Suppose we have a probability space (f~, P) and two random variables X and Y defined on ft Then X and Y are independent if, for all values z of X and y of Y, the events X = x and Y = y are independent When this is the case, we write
I p ( X , Y ) , where IF stands for independent in P
E x a m p l e 2.20 Let f~ be the set of all cards in an ordinary deck, and let P assign 1/52 to each card Define random variables as follows:
Variable Value Outcomes Mapped to This Value
/i~ rl All royal cards
r2 All nonroyal cards
s2 All nonspades .
Then the random variables R and S are independent That is,
Trang 35d e p e n d e n t g i v e n Z if for M1 values x of X , y of Y, and z of Z, whenever
P ( z ) ~ O, the events X = x and Y = y are conditionMly independent given the even Z = z When this is the case, we write
Outcomes Mapped to This Value
All objects containing an "A"
All objects containing a "B"
All square objects All circular objects All black objects All white objects
Then L and S are conditionally independent given C That is,
It is left as an exercise to show that it holds for the other combinations
Independence and conditional independence can also be defined for sets of random variables
Trang 362.2 R A N D O M V A R I A B L E S 23
D e f i n i t i o n 2.8 Suppose we have a probability space (f~, P) and two sets A and
B containing random variables defined on f~ Let a and b be sets of values of the random variables in A and [3, respectively The sets A and B are said to be
i n d e p e n d e n t if, for all values of the variables in the sets a and b, the events
A = a and g = b are independent When this is the case, we write
IN(A, B),
where IN stands for independent in P
E x a m p l e 2 2 2 Let f~ be the set of all cards in an ordinary deck, and let P assign 1/52 to each card Define random variables as follows:
Variable
R
T
Value Outcomes Mapped to This Value
tl All tens and jacks t2 All cards t h a t are neither tens nor jacks
It is left as an exercise to show that it holds for the other combinations
W h e n a set contains a single variable, we do not ordinarily show the braces For example, we write I n d e p e n d e n c y 2.7 as
T}, S)
D e f i n i t i o n 2.9 Suppose we have a probability space (~, P) and three sets A, [3, and C containing random variables defined on ft Let a, b, and c be sets of values of the random variables in A, [3, and C, respectively Then the sets A and B are said to be c o n d i t i o n a l l y i n d e p e n d e n t g i v e n t h e s e t C if, for all values of the variables in the sets a, b, and c, whenever P ( c ) ~ O, the events
A = a and [3 - b are conditionally independent given the event C - c When this is the case, we write
Zp(A, BIC)
Trang 3724 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S
Figure 2.2: Objects with five properties
E x a m p l e 2.23 Suppose we use the Principle of Indifference to assign proba- bilities to the objects in Figure 2.2, and define random variables as follows:
All objects containing a "1"
All objects containing a "2"
All objects covered with lines All objects not covered with lines All grey objects
All white objects All square objects All circular objects All objects containing a number in a large font All objects containing a number in a small font
It is left as an exercise to show for all values of v, l, c, s, and f that
So we have
Ip({V, L}, { S , F } I C )
2.3 The Meaning of Probability
When one does not have the opportunity to study probability theory in depth, one is often left with the impression that all probabilities are computed using ratios Next, we discuss the meaning of probability in more depth and show that this is not how probabilities are ordinarily determined
Trang 382.3 THE M E A N I N G OF P R O B A B I L I T Y 25
A classic t e x t b o o k example of probability concerns tossing a coin Because the coin is symmetrical, we use the Principle of Indifference to assign
P(Heads) = P(Tails) -.5
Suppose instead we toss a thumbtack It can also land one of two ways That
is, it could land on its fiat end, which we will call "heads," or it could land with the edge of the fiat end and the point touching the ground, which we will call
"tails." Because the thumbtack is not symmetrical, we have no reason to apply the Principle of Indifference and assign probabilities of 5 to both outcomes How then should we assign the probabilities? In the case of the coin, when
we assign P(heads) - 5, we are implicitly assuming t h a t if we tossed the coin a large number of times it would land heads about half the time T h a t
is, if we tossed the coin 1000 times, we would expect it to land heads about
500 times This notion of repeatedly performing the experiment gives us a
m e t h o d for computing (or at least estimating) the probability T h a t is, if we repeat an experiment m a n y times, we are fairly certain t h a t the probability of
an outcome is about equal to the fraction of times the outcome occurs For example, a student tossed a t h u m b t a c k 10,000 times and it landed heads 3761 times So
3761 P(Heads) ~ = 3761
10,000 Indeed, in 1919 Richard von Mises used the limit of this fraction as the definition
of probability T h a t is, if n is the number of tosses and S~ is the number of times the t h u m b t a c k lands heads, then
This approach to probability is called the relative frequency approach to probability, and probabilities obtained using this approach are called relative frequencies A frequentist is someone who feels this is the only way we can obtain probabilities Note that, according to this approach, we can never know
a probability for certain For example, if we tossed a coin 10,000 times and it landed heads 4991 times, we would estimate
4991
10,000
Trang 3926 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S
On the other hand, if we used the Principle of Indifference, we would assign P(Heads) = 5 In the case of the coin, the probability may not actually be 5 because the coin may not be perfectly symmetrical For example, Kerrich [1946] found t h a t the six came up the most in the toss of a die and t h a t one came up the least This makes sense because, at t h a t time, the spots on the die were hollowed out of the die So the die was lightest on the side with a six On the other hand, in experiments involving cards or urns, it seems we can be certain
of probabilities obtained using the Principle of Indifference
E x a m p l e 2.24 Suppose we toss an asymmetrical six-sided die, and in 1000
tosses we observe the six sides coming up the following n u m b e r of times:
E x a m p l e 2.25 Suppose our population is all males in the United States be-
tween the ages of 31 and 85, and we are interested in the probability of such
males having high blood pressure Then if we sample 10,000 males, this set of males is our sample, f-hrthermore, if 3210 have high blood pressure, we estimate
P( High Blood Pressure) ~ 3210
in the group t h a t have high blood pressure In theory, we would have to have
an infinite number of males to determine the probability exactly The current set of males in this age group is called a f i n i t e p o p u l a t i o n The fraction
of t h e m with high blood pressure is the probability of obtaining a male with high blood pressure when we sample him from the set of all males in the age group This latter probability is simply the ratio of males with high blood pressure W h e n doing statistical inference, we sometimes want to estimate the ratio in a finite population from a sample of the population, and other times
we want to estimate a propensity from a finite sequence of observations For
Trang 402.3 T H E M E A N I N G O F P R O B A B I L I T Y 27
example, T V raters ordinarily want to estimate the actual fraction of people in
a nation watching a show from a sample of those people On the other hand, medical scientists want to estimate the propensity with which males tend to have high blood pressure from a finite sequence of males One can create an infinite sequence from a finite population by returning a sampled item back
to the population before sampling the next item This is called " s a m p l i n g
w i t h r e p l a c e m e n t " In practice, it is rarely done, but ordinarily the finite population is so large t h a t statisticians make the simplifying assumption t h a t
it is done T h a t is, they do not replace the item, but still assume the ratio is unchanged for the next item sampled
W h e n sampling, the observed relative frequency is called the m a x i m u m
l i k e l i h o o d e s t i m a t e of the probability (limit of the relative frequency) be- cause it is the estimate of the probability t h a t makes the observed sequence most probable when we assume the trials (repetitions of the experiment) are probabilistically independent
E x a m p l e 2.26 S u p p o s e we toss a t h u m b t a c k four t i m e s a n d we observe t h e
s e q u e n c e [heads, t a i l s , h e a d s , heads] T h e n t h e m a x i m u m likelihood e s t i m a t e o f