1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Probabilistic methods for financial and marketing informatics

427 390 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 427
Dung lượng 17,44 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

However, several members of the board said they'd like to see students equipped with knowledge of cutting edge applications of computer science to areas such as decision analysis, risk m

Trang 2

Probabilistic Methods

for Financial and Marketing Informatics

Trang 3

Morgan Kaufmann Publishers is an imprint of Elsevier

500 Sansome Street, Suite 400, San Francisco, CA 94111

This book is printed on acid-free paper

@ 2007 by Elsevier Inc All rights reserved

Designations used by companies to distinguish their products are often claimed

as trademarks or registered trademarks In all instances in which Morgan Kauf- mann Publishers is aware of a claim, the product names appear in initial capital

or all capital letters Readers, however, should contact the appropriate compa- nies for more complete information regarding trademarks and registration

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means electronic, mechanical, photocopy- ing, scanning, or otherwise-without prior written permission of the publisher Permissions may be sought directly from Elsevier's Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.com You may also complete your request on- line via the Elsevier homepage (http://elsevier.com), by selecting "Support Contact" then "Copyright and Permission" and then "Obtaining Permissions."

Library of Congress Cataloging-in-Publication Data

Working together to grow

libraries in developing countries

www.elsevier.com I www.bookaid.org I www.sabre.org

Trang 4

P r e f a c e

This book is based on a course I recently developed for computer science ma- jors at Northeastern Illinois University (NEIU) The motivation for developing this course came from guidance I obtained from the NEIU Computer Science Department Advisory Board One objective of this Board is to advise the De- partment concerning the maintenance of curricula that is relevant to the needs

of companies in Chicagoland The Board consists of individuals in IT depart- ments from major companies such as Walgreen's, AON Company, United Air- lines, Harris Bank, and Microsoft After the dot.com bust and the introduction

of outsourcing, it became evident that students, trained only in the fundamen- tals of computer science, programming, web design, etc., often did not have the skills to compete in the current U.S job market So I asked the Advisory Board what else the students should know The board unanimously felt the students needed business skills such as knowledge of IT project management, marketing, and finance As a result, our revised curriculum, for students who hoped to obtain employment immediately following graduation, contained a number of business courses However, several members of the board said they'd like to see students equipped with knowledge of cutting edge applications of computer science to areas such as decision analysis, risk management, data mining, and market basket analysis I realized that some of the best work in these areas was being done in my own field, namely Bayesian networks After consulting with colleagues worldwide and checking on topics taught in similar programs

at other universities, I decided it was time for a course on applying probabilis- tic reasoning to business problems So my new course called "Informatics for MIS Students" and this book called Probabilistic Methods for Financial and Marketing Informatics were conceived

Part I covers the basics of Bayesian networks and decision analysis Much

of this material is based on my 2004 book Learning Bayesian Networks How- ever, I've tried to make the material more accessible Rather than dwelling on rigor, algorithms, and proofs of theorems, I concentrate on showing examples and using the software package Netica to represent and solve problems The specific content of Part I is as follows: Chapter 1 provides a definition of in- formatics and probabilistic informatics Chapter 2 reviews the probability and statistics needed to understand the remainder of the book Chapter 3 presents Bayesian networks and inference in Bayesian networks Chapter 4 concerns learning Bayesian networks from data Chapter 5 introduces decision analysis

iii

Trang 5

in these sections seems inherently more difficult than most of the other material

in the book However, they do not require as background the material from Part I that is marked with a star ( ~ ) Chapter 8 discusses modeling real options, which concerns decisions a company must make as to what projects it should pursue Chapter 9 covers venture capital decision making, which is the process of deciding whether to invest money in a start-up company Chapter

10 discusses a model for bankruptcy prediction

Part III contains chapters on two important areas of marketing First, Chapter 11 shows methods for doing collaborative filtering and market basket analysis These disciplines concern determining what products an individual might prefer based on how the individual feels about other products Finally, Chapter 12 presents a technique for doing targeted advertising, which is the process of identifying those customers to whom advertisements should be sent There is too much material for me to cover the entire book in a one semester course at NEIU Since the course requires discrete mathematics and business statistics as prerequisites, I only review most of the material in Chapter 2 However, I do discuss conditional independence in depth because ordinarily the students have not been exposed to this concept I then cover the following sections from the remainder of the book:

Trang 6

The course is titled "Informatics for MIS Students," and is a required course

in the MIS (Management Information Science) concentration of NEIU's Com- puter Science M.S Degree Program This book should be appropriate for any similar course in an MIS, computer science, business, or MBA program It is intended for upper level undergraduate and graduate students Besides having taken one or two courses covering basic probability and statistics, it would be useful but not necessary for the student to have studied data structures Part I of the book could also be used for the first part of any course involving probabilistic reasoning using Bayesian networks That is, although many of the examples in Part I concern the stock market and applications to business problems, I've presented the material in a general way Therefore, an instructor could use Part I to cover basic concepts and then provide papers relative to

a particular domain of interest For example, if the course is "Probabilistic Methods for Medical Informatics," the instructor could cover Part I of this book, and then provide papers concerning applications in the medical domain For the most part, the applications discussed in Part II were the results

of research done at the School of Business of the University of Kansas, while the applications in Part III were the results of research done by the Machine Learning and Applied Statistics Group of Microsoft Research The reason is not that I have any particular affiliations with either of this institutions Rather, I did an extensive search for financial and marketing applications, and the ones

I found that seemed to be most carefully designed and evaluated came from these institutions

I thank Catherine Shenoy for reviewing the chapter on investment science and Dawn Homes, Francisco Javier Dfez, and Padmini Jyotishmati for reviewing the entire book They all offered many useful comments and criticisms I thank Prakash Shenoy and Edwin Burmeister for correspondence concerning some of the content of the book I thank my co-author, Xia Jiang, for giving me the idea to write this book in the first place, and for her efforts on the book itself Finally, I thank Prentice Hall for granting me permission to reprint material

Rich Neapolitan RE-Neapolit an@neiu, ed u

Trang 7

This Page Intentionally Left Blank

Trang 8

C o n t e n t s

I B a y e s i a n N e t w o r k s a n d D e c i s i o n A n a l y s i s

1.1 W h a t Is I n f o r m a t i c s ? 4

1.2 P r o b a b i l i s t i c I n f o r m a t i c s 6

1.3 O u t l i n e of T h i s B o o k 7

2 Probability and Statistics 9 2.1 P r o b a b i l i t y B a s i c s 9

2.1.1 P r o b a b i l i t y S p a c e s 10

2.1.2 C o n d i t i o n a l P r o b a b i l i t y a n d I n d e p e n d e n c e 12

2.1.3 B a y e s ' T h e o r e m 15

2.2 R a n d o m V a r i a b l e s 16

2.2.1 P r o b a b i l i t y D i s t r i b u t i o n s of R a n d o m V a r i a b l e s 16

2.2.2 I n d e p e n d e n c e of R a n d o m V a r i a b l e s 21

2.3 T h e M e a n i n g of P r o b a b i l i t y 24

2.3.1 R e l a t i v e F r e q u e n c y A p p r o a c h t o P r o b a b i l i t y 25

2.3.2 S u b j e c t i v e A p p r o a c h t o P r o b a b i l i t y 28

2.4 R a n d o m V a r i a b l e s in A p p l i c a t i o n s 30

2.5 S t a t i s t i c a l C o n c e p t s 34

2.5.1 E x p e c t e d V a l u e 34

2.5.2 V a r i a n c e a n d C o v a r i a n c e 35

2.5.3 L i n e a r R e g r e s s i o n 41

3 Bayesian Networks 53 3.1 W h a t Is a B a y e s i a n N e t w o r k ? 54

3.2 P r o p e r t i e s of B a y e s i a n N e t w o r k s 56

3.2.1 D e f i n i t i o n of a B a y e s i a n N e t w o r k 56

3.2.2 R e p r e s e n t a t i o n of a B a y e s i a n N e t w o r k 59

3.3 C a u s a l N e t w o r k s as B a y e s i a n N e t w o r k s 63

3.3.1 C a u s a l i t y 63

3.3.2 C a u s a l i t y a n d t h e M a r k o v C o n d i t i o n 68

Trang 9

viii C O N T E N T S

3.4

3.5

3.6

3.3.3 T h e M a r k o v C o n d i t i o n w i t h o u t C a u s a l i t y 71

Inference in Bayesian Networks 72

3.4.1 E x a m p l e s of Inference 73

3.4.2 Inference A l g o r i t h m s a n d Packages 75

3.4.3 Inference Using N e t i c a 77

How Do We O b t a i n t h e P r o b a b i l i t i e s ? 78

3.5.1 T h e Noisy O R - G a t e Model 79

3.5.2 M e t h o d s for Discretizing C o n t i n u o u s Variables * 86

E n t a i l e d C o n d i t i o n a l I n d e p e n d e n c i e s * 92

3.6.1 E x a m p l e s of E n t a i l e d C o n d i t i o n a l I n d e p e n d e n c i e s 92

3.6.2 d - S e p a r a t i o n 95

3.6.3 Faithful a n d Unfaithful P r o b a b i l i t y D i s t r i b u t i o n s 99

3.6.4 M a r k o v B l a n k e t s a n d B o u n d a r i e s 102

Learning Bayesian Networks 4.1 4.2 4.3 4.4 4.5 4.6 4.7 111 P a r a m e t e r L e a r n i n g 112

4.1.1 L e a r n i n g a Single P a r a m e t e r 112

4.1.2 L e a r n i n g All P a r a m e t e r s in a Bayesian N e t w o r k 119

L e a r n i n g S t r u c t u r e (Model Selection) 126

Score-Based S t r u c t u r e L e a r n i n g * 127

4.3.1 L e a r n i n g S t r u c t u r e Using t h e Bayesian Score 127

4.3.2 Model A v e r a g i n g 137

C o n s t r a i n t - B a s e d S t r u c t u r e L e a r n i n g 138

4.4.1 L e a r n i n g a D A G Faithful to P 138

4.4.2 L e a r n i n g a D A G in W h i c h P Is E m b e d d e d Faithfully ~ 144

C a u s a l L e a r n i n g 145

4.5.1 Causal Faithfulness A s s u m p t i o n 145

4.5.2 Causal E m b e d d e d Faithfulness A s s u m p t i o n ~ 148

Software Packages for L e a r n i n g 151

E x a m p l e s of L e a r n i n g 153

4.7.1 L e a r n i n g B a y e s i a n Networks 153

4.7.2 Causal L e a r n i n g 162

Decision Analysis Fundamentals 5.1 5.2 5.3 1 7 7 Decision Trees 178

5.1.1 Simple E x a m p l e s 178

5.1.2 Solving More C o m p l e x Decision Trees 182

Influence D i a g r a m s 195

5.2.1 R e p r e s e n t i n g w i t h Influence D i a g r a m s 195

5.2.2 Solving Influence D i a g r a m s 202

5.2.3 Techniques for Solving Influence D i a g r a m s * 202

5.2.4 Solving Influence D i a g r a m s Using N e t i c a 207

D y n a m i c Networks * 212

5.3.1 D y n a m i c Bayesian Networks 212

5.3.2 D y n a m i c Influence D i a g r a m s 219

Trang 10

Further Techniques in Decision Analysis 229

6.1 M o d e l i n g R i s k P r e f e r e n c e s 230

6.1.1 T h e E x p o n e n t i a l U t i l i t y F u n c t i o n 231

6.1.2 A D e c r e a s i n g R i s k - A v e r s e U t i l i t y F u n c t i o n 235

6.2 A n a l y z i n g R i s k D i r e c t l y 236

6.2.1 U s i n g t h e V a r i a n c e t o M e a s u r e Risk 236

6.2.2 R i s k Profiles 238

6.3 D o m i n a n c e 240

6.3.1 D e t e r m i n i s t i c D o m i n a n c e 240

6.3.2 S t o c h a s t i c D o m i n a n c e 241

6.3.3 G o o d Decision versus G o o d O u t c o m e 243

6.4 S e n s i t i v i t y A n a l y s i s 244

6.4.1 S i m p l e M o d e l s 244

6.4.2 A M o r e D e t a i l e d M o d e l 250

6.5 Value of I n f o r m a t i o n 254

6.5.1 E x p e c t e d V a l u e of P e r f e c t I n f o r m a t i o n 255

6.5.2 E x p e c t e d Value of I m p e r f e c t I n f o r m a t i o n 257

6.6 N o r m a t i v e Decision A n a l y s i s 259

I I Financial Applications 265 Investment Science 267 7.1 Basics of I n v e s t m e n t Science 267

7.1.1 I n t e r e s t 267

7.1.2 N e t P r e s e n t Value 270

7.1.3 S t o c k s 271

7.1.4 P o r t f o l i o s 276

7.1.5 T h e M a r k e t P o r t f o l i o 276

7.1.6 M a r k e t I n d i c e s 277

7.2 A d v a n c e d Topics in I n v e s t m e n t Science ~ 278

7.2.1 M e a n - V a r i a n c e P o r t f o l i o T h e o r y 278

7.2.2 M a r k e t Efficiency a n d C A P M 285

7.2.3 F a c t o r M o d e l s a n d A P T 296

7.2.4 E q u i t y V a l u a t i o n M o d e l s 303

7.3 A B a y e s i a n N e t w o r k P o r t f o l i o Risk A n a l y z e r * 314

7.3.1 N e t w o r k S t r u c t u r e 315

7.3.2 N e t w o r k P a r a m e t e r s 317

7.3.3 T h e P o r t f o l i o Value a n d A d d i n g E v i d e n c e 319

M o d e l i n g Real Options 329 8.1 S o l v i n g R e a l O p t i o n s Decision P r o b l e m s 330

8.2 M a k i n g a P l a n 339

8.3 S e n s i t i v i t y A n a l y s i s 340

Trang 11

Venture Capital Decision Making 343

9.1 A S i m p l e V C D e c i s i o n M o d e l 345

9.2 A D e t a i l e d V C D e c i s i o n M o d e l 347

9.3 M o d e l i n g R e a l D e c i s i o n s 350

9.A A p p e n d i x 352

10 B a n k r u p t c y Prediction 357 10.1 A B a y e s i a n N e t w o r k for P r e d i c t i n g B a n k r u p t c y 358

10.1.1 N a i v e B a y e s i a n N e t w o r k s 358

10.1.2 C o n s t r u c t i n g t h e B a n k r u p t c y P r e d i c t i o n N e t w o r k 358

10.2 E x p e r i m e n t s 364

10.2.1 M e t h o d 364

10.2.2 R e s u l t s 366

10.2.3 D i s c u s s i o n 369

III Marketing Applications 371 11 Collaborative Filtering 373 11.1 M e m o r y - B a s e d M e t h o d s 374

11.2 M o d e l - B a s e d M e t h o d s 377

11.2.1 P r o b a b i l i s t i c C o l l a b o r a t i v e F i l t e r i n g 377

11.2.2 A C l u s t e r M o d e l 378

11.2.3 A B a y e s i a n N e t w o r k M o d e l 379

11.3 E x p e r i m e n t s 380

11.3.1 T h e D a t a S e t s 380

11.3.2 M e t h o d 380

11.3.3 R e s u l t s 382

12 Targeted Advertising 387 12.1 C l a s s P r o b a b i l i t y T r e e s 388

12.2 A p p l i c a t i o n t o T a r g e t e d A d v e r t i s i n g 390

12.2.1 C a l c u l a t i n g E x p e c t e d Lift in P r o f i t 390

12.2.2 I d e n t i f y i n g S u b p o p u l a t i o n s w i t h P o s i t i v e E L P s 392

12.2.3 E x p e r i m e n t s 393

Trang 12

About the Authors

Richard E Neapolitan is Professor and Chair of Computer Science at North- eastern Illinois University He has previously written three books including the seminal 1990 Bayesian network text Probabilistic Reasoning in Ezpert Systems

More recently, he wrote the 2004 text Learning Bayesian networks, and Foun-

of the most widely-used algorithms texts world-wide His books have the rep- utation of making difficult concepts easy to understand because of the logical flow of the material, the simplicity of the explanations, and the clear examples

Xia Jiang received an M.S in Mechanical Engineering from Rose Hulman Uni- versity and is currently a Ph.D candidate in the Biomedical Informatics Pro- gram at the University of Pittsburgh She has published theoretical papers concerning Bayesian networks, along with applications of Bayesian networks to biosurveillance

Trang 13

This Page Intentionally Left Blank

Trang 14

P a r t I

Bayesian Networks and

Decision Analysis

Trang 15

This Page Intentionally Left Blank

Trang 16

Chapter 1

P r o b a b i l i s t i c I n f o r m a t i c s

Informatics programs in the United States go back at least to the 1980s when Stanford University offered a Ph.D in medical informatics Since that time, a number of informatics programs in other disciplines have emerged at universities throughout the United States These programs go by various names, including bioinformatics, medical informatics, chemical informatics, music informatics, marketing informatics, etc What do these programs have in common? To an- swer that question we must articulate what we mean by the term "informatics." Since other disciplines are usually referenced when we discuss informatics, some define informatics as the application of information technology in the context of another field However, such a definition does not really tell us the focus of in- formatics itself First, we explain what we mean by the term informatics Then

we discuss why we have chosen to concentrate on the probabilistic approach in this book Finally, we provide an outline of the material that will be covered in the rest of the book

Trang 17

4 C H A P T E R 1 P R O B A B I L I S T I C I N F O R M A T I C S

1.1 W h a t Is I n f o r m a t i c s ?

In much of western Europe, informatics has come to mean the rough trans- lation of the English "computer science," which is the discipline that studies computable processes Certainly, there is overlap between computer science programs and informatics programs, but they are not the same Informatics programs ordinarily investigate subjects such as biology and medicine, whereas computer science programs do not So the European definition does not suffice for the way the word is currently used in the United States

To gain insight into the meaning of informatics, let us consider the suffix

"-ics," which means the science, art, or study of some entity For example,

"linguistics" is the study of the nature of language, "economics" is the study

of the production and distribution of goods, and "photonics" is the study of electromagnetic energy whose basic unit is the photon Given this, informatics should be the study of information Indeed, WordNet 2.1 defines informat- ics as "the science concerned with gathering, manipulating, storing, retrieving and classifying recorded information." To proceed from this definition we need

to define the word "information." Most dictionary definitions do not help as far as giving us anything concrete That is, they define information either as knowledge or as a collection of data, which means we are left with the situation

of determining the meaning of knowledge and data To arrive at a concrete definition of informatics, let's define data, information, and knowledge first

By datum we mean a character string that can be recognized as a unit For example, the nucleotide G in the nucleotide sequence GATC is a datum, the field "cancer" in a record in a medical data base is a datum, and the field "Gone with the Wind" in a movie data base is a datum Note that a single character, a word, or a group of words can be a datum depending on the particular application Data then are more than one datum By information

we mean the meaning given to data For example, in a medical data base the data "Joe Smith" and "cancer" in the same record mean that Joe Smith has cancer By knowledge we mean dicta which enable us to infer new information from existing information For example, suppose we have the following item of knowledge (dictum): 1

Finally, we define informatics as the discipline that applies the method- ologies of science and engineering to information It concerns organizing data

1 Such an item of knowledge would be part of a rule-based expert system

Trang 18

1.1 W H A T IS I N F O R M A T I C S ? 5

into information, learning knowledge from information, learning new informa- tion from existing information and knowledge, and making decisions based on the knowledge and information learned 9 We use engineering to develop the algorithms t h a t learn knowledge from information and t h a t learn information from information and knowledge We use science to test the accuracy of these algorithms

Next, we show several examples t h a t illustrate how informatics pertains to other disciplines

E x a m p l e 1.1 ( m e d i c a l i n f o r m a t i c s ) Suppose we have a large data file of patients records as follows:

n o

no

no yes

to obtain the new i n f o r m a t i o n that "there is a 5% chance Joe Smith also has lung cancer."

E x a m p l e 1.:] ( b i o i n f o r m a t i c s ) Suppose we have long homologous D N A se-

quences from the human, the chimpanzee, the gorilla, the orangutan, and the rhesus monkey From this i n f o r m a t i o n we can use the methodologies of infor- matics to obtain the new i n f o r m a t i o n that it is most probable that the human and the chimpanzee are the most closely related of the five species

E x a m p l e 1.3 ( m a r k e t i n g i n f o r m a t i c s ) Suppose we have a large data file

of movie ratings as follows:

Aviator Shall We Dance Dirty Dancing Vanity Fair

This means, for example, that Person 1 rated Aviator the lowest (1) and Shall

We Dance the highest (5) From the information in this data file, we can develop

Trang 19

6 C H A P T E R 1 P R O B A B I L I S T I C I N F O R M A T I C S

a knowledge system that will enable us to estimate how an individual will rate

a particular movie For example, suppose K a t h y Black rates Aviator as 1, Shall

We Dance as 5, and Dirty Dancing as 5 The system could estimate how K a t h y will rate Vanity Fair Just by eyeballing the data in the five records shown,

we see that K a t h y ' s ratings on the first three movies are similar to those of Persons 1, 4, and 5 Since they all rated Vanity Fair high, based on these five records, we would suspect K a t h y would rate it high A n informatics algorithm can formalize a way to make these predictions This task of predicting the utility of an item to a particular user based on the utilities assigned by other users is called c o l l a b o r a t i v e f i l t e r i n g

In this book we concentrate on two related areas of informatics, namely

financial informatics and marketing informatics F i n a n c i a l i n f o r m a t i c s in-

volves applying the methods of informatics to the management of money and other assets In particular, it concerns determining the risk involved in some financial venture As an example, we might develop a tool to improve portfolio risk analysis M a r k e t i n g i n f o r m a t i c s involves applying the methods of infor- matics to promoting and selling products or services For example, we might determine which advertisements should be presented to a given Web user based

on t h a t user's navigation pattern

Before ending this section, let's discuss the relationship between informatics and the relatively new expression "'data mining." The term d a t a mining can

be traced back to the First International Conference on Knowledge Discovery and Data Mining (KDD-95) in 1995 Briefly, d a t a m i n i n g is the process of

extrapolating unknown knowledge from a large amount of observational data Recall t h a t we said informatics concerns (1) organizing data into information, (2) learning knowledge from information, (3) learning new information from existing information and knowledge, and (4) making decisions based on the knowledge and information learned So, technically speaking, data mining is

a subfield of informatics that includes only the first two of these procedures However, both terms are still evolving, and some individuals use d a t a mining

to refer to all four procedures

1.2 P r o b a b i l i s t i c I n f o r m a t i c s

As can be seen in Examples 1.1, 1.2, and 1.3, the knowledge we use to process information often does not consist of I F - T H E N rules, such as the one concerning plants discussed earlier Rather, we only know relationships such as "smoking makes lung cancer more likely." Similarly, our conclusions are uncertain For example, we feel it is most likely t h a t the closest living relative of the human

is the chimpanzee, but we are not certain of this So ordinarily we must reason under uncertainty when handling information and knowledge In the 1960s and 1970s a number of new formalisms for handling uncertainty were developed, in- cluding certainty factors, the Dempster-Shafer Theory of Evidence, fuzzy logic, and fuzzy set theory Probability theory has a long history of representing un- certainty in a formal axiomatic way Neapolitan [1990] contrasts the various

Trang 20

us to prove results based on assumptions concerning a system An example of

a heuristic algorithm is the one developed for collaborative filtering in Chapter

11, Section 11.1

An a b s t r a c t m o d e l is a theoretical construct that represents a physical process with a set of variables and a set of quantitative relationships (axioms) among them We use models so we can reason within an idealized framework and thereby make predictions/determinations about a system We can math- ematically prove these predictions/determinations are "correct," but they are correct only to the extent that the model accurately represents the system A

m o d e l - b a s e d a l g o r i t h m therefore makes predictions/determinations within the framework of some model Algorithms that make predictions/determinations within the framework of probability theory are model-based algorithms We can prove results concerning these algorithms based on the axioms of probability theory, which are discussed in Chapter 2 We concentrate on such algorithms

in this book In particular, we present algorithms that use Bayesian networks

to reason within the framework of probability theory

1.3 O u t l i n e o f T h i s B o o k

In Part I we cover the basics of Bayesian networks and decision analysis Chap- ter 2 reviews the probability and statistics necessary to understanding the re- mainder of the book In Chapter 3 we present Bayesian networks, which are graphical structures that represent the probabilistic relationships among many related variables Bayesian networks have become one of the most prominent at- chitectures for representing multivariate probability distributions and enabling probabilistic inference using such distributions Chapter 4 shows how we can learn Bayesian networks from data A Bayesian network augmented with a value node and decision nodes is called an influence diagram We can use an influence diagram to recommend a decision based on the uncertain relationships among the variables and the preferences of the user The field that investigates such decisions is called decision analysis Chapter 5 introduces decision analysis, while Chapter 6 covers further topics in decision analysis Once you have com- pleted Part I, you should have a basic understanding of how Bayesian networks and decision analysis can be used to represent and solve real-world problems Parts II and III then cover applications to specific problems Part II covers financial applications Specifically, Chapter 7 presents the basics of investment science and develops a Bayesian network for portfolio risk analysis In Chapter 2Fuzzy set theory and fuzzy logic model a different class of problems than probabil- ity theory and therefore complement probability theory rather than compete with it See [Zadeh, 1995] or [Neapolitan, 1992]

Trang 21

8 C H A P T E R 1 P R O B A B I L I S T I C I N F O R M A T I C S

8 we discuss the modeling of real options, which concerns decisions a company must make as to what projects it should pursue Chapter 9 covers venture cap- ital decision making, which is the process of deciding whether to invest money

in a start-up company In Chapter 10 we show an application to bankruptcy prediction Finally, Part III contains chapters on two of the most important areas of marketing First, Chapter 11 shows an application to collaborative filtering/market basket analysis These disciplines concern determining what products an individual might prefer based on how the user feels about other products Second, Chapter 12 presents an application to targeted advertising, which is the process of identifying those customers to whom advertisements should be sent

Trang 22

Chapter 2

Probability and Statistics

This chapter reviews the probability and statistics you need to read the remainder of this book In Section 2.1 we present the basics of probability theory, while in Section 2.2 we review random variables Section 2.3 briefly discusses the meaning of probability In Section 2.4 we show how random variables are used in practice Finally, Section 2.5 reviews concepts in statistics, such as expected value, variance, and covariance

2.1 Probability Basics

After defining probability spaces, we discuss conditional probability, indepen- dence and conditional independence, and Bayes' Theorem

Trang 23

c o m e s The set of all outcomes is called the s a m p l e s p a c e or population

M a t h e m a t i c i a n s ordinarily say "sample space," while social scientists ordinarily say "population." We will say sample space In this simple review we assume the sample space is finite Any subset of a sample space is called an e v e n t A subset containing exactly one element is called an e l e m e n t a r y e v e n t

E x a m p l e 2.1 Suppose we have the experiment of drawing the top card from

an ordinary deck of cards Then the set

E - {jack of hearts, jack of clubs, jack of spades, jack of diamonds}

is an event, and the set

D e f i n i t i o n 2.1 Suppose we have a sample space f~ containing n distinct ele- ments: that is,

~-~ - - { e l , e 2 , - , en}

A function that assigns a real number P ( [ ) to each event s C_ ~ is called

a p r o b a b i l i t y f u n c t i o n on the set of subsets of f~ if it satisfies the following conditions:

1 O < P(ei) <_ l for l < i < n

2 P ( e l ) + P(r + - V P(en) - 1

3 For each event that is not an elementary event, P ( E ) is the sum of the

probabilities of the elementary events whose outcomes are in E For

example, if

{e3, e6, e8}

Trang 24

2.1 P R O B A B I L I T Y B A S I C S 11

then

P ( E ) = P(e3) + P ( e 6 ) + P(es)

The pair (f~, P) is called a p r o b a b i l i t y space

Because probability is defined as a function whose domain is a set of sets,

we should write P ( { e i } ) instead of P(ei) when denoting the probability of an elementary event However, for the sake of simplicity, we do not do this In the same way, we write P(e3, e6, es) instead of P({e3, e6, e8})

The most straightforward way to assign probabilities is to use the P r i n c i p l e

o f I n d i f f e r e n c e , which says t h a t outcomes are to be considered equiprobable

if we have no reason to expect one over the other According to this principle, when there are n elementary events, each has probability equal to 1In

E x a m p l e 2.2 Let the experiment be tossing a coin Then the sample space is

ft = {heads, tails}, and, according to the Principle of Indifference, we assign

P(heads) = P ( t a i l s ) = .5

We stress t h a t there is nothing in the definition of a probability space t h a t says we must assign the value of 5 to the probabilities of heads and tails We could assign P ( h e a d s ) = 7 and P(tails)= .3 However, if we have no reason

to expect one outcome over the other, we give t h e m the same probability

E x a m p l e 2.3 Let the experiment be drawing the top card from a deck of 52 cards Then t2 contains the faces of the 52 cards, and, according to the Principle

of Indifference, we assign P(e) = 1/52 for each e E t2 For example,

1

POack of hearts) - 52 The event

E = {jack of hearts, jack of clubs, jack of spades, jack of diamonds} means that the card drawn is a jack Its probability is

P ( E ) = P g a c k of hearts)+ Pgack of dubs)+

P(jack of spades) + P(jack of diamonds)

We have Theorem 2.1 concerning probabi!ity spaces Its proof is left as an exercise

Trang 25

12 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S

T h e o r e m 2.1 Let (f~, P) be a probability space Then

1 P ( f ~ ) - 1

2 0 _< P(E) _< 1 for every E C_ s

3 For every two subsets E and F of f~ such that E n F - 0,

P(E U F) - P(E) + P(F),

where 0 denotes the empty set

Example 2.4 Suppose we draw the top card from a deck of cards Denote by

Queen the set containing the 4 queens and by King the set containing the 4 kings Then

1 1 2

P(Queen U King) - P(Queen) + P(King) = ~ + 13 - I~

because Queen N King = ~ Next denote by Spade the set containing the 13 spades The sets Queen and Spade are not disjoint; so their probabilities are not additive However, it is not hard to prove that, in general,

P(E U F) - P(E) + P(F) - P(E n F)

So

P(Queen U Spade) P(Queen) + P(Spade) - P(Queen n Spade)

13 4 52 13

2.1.2 Conditional Probability and I n d e p e n d e n c e

We start with a definition

Definition 2.2 Let E and F be events such that P(F) ~ 0 Then the c o n d i -

t i o n a l p r o b a b i l i t y of E given F, denoted P(E[F), is given by

P(EN[)

P(EIF)- P(F) "

We can gain intuition for this definition by considering probabilities that are assigned using the Principle of Indifference In this case, P(E[F), as defined above, is the ratio of the number of items in E N F to the number of items in F

We show this as follows Let n be the number of items in the sample space, nF

be the number of items in F, and nEF be the number of items in E N F Then

P (E n F) nEF / n nEF

P(F) n F / n nF '

which is the ratio of the number of items in E n F to the number of items in

F As far as the meaning is concerned, P(EIF ) is our belief that E contains the outcome given that we know F contains the outcome

Trang 26

2.1 P R O B A B I L I T Y B A S I C S 13

E x a m p l e 2.5 Again, consider drawing the top card from a deck of cards Let Jack be the set of the 4 jacks, RedRoyalCard be the set of the 6 red royal cards, 1 and Club be the set of the 13 clubs Then

4 1 P(Jack)- ~ -

P (Jack I Red RoyalCard ) = P(Jack N RedRoyalCard) _ 2/52 _ _ 1

P(RedRoyalCard) 6/52 3 P(Jack A Club) 1/52 1

P(JacklClub)- P(Club) = 13/52 = 1-3"

Notice in the previous example t h a t P(JacklClub ) = P(Jack) This means

t h a t finding out the card is a club does not change the likelihood t h a t it is a jack

We say t h a t the two events are independent in this case, which is formalized in the following definition:

D e f i n i t i o n 2.3 Two events E and F are i n d e p e n d e n t if one of the following holds:

I P(EIF ) = P(E) and

E and F are independent if and only if P ( E C / F ) = P ( E ) P ( F )

If you've previously studied probability, you should have already been in- troduced to the concept of independence However, a generalization of inde- pendence, called conditional independence, is not covered in many introductory texts This concept is important to the applications discussed in this book We discuss it next

D e f i n i t i o n 2.4 Two events E and F are c o n d i t i o n a l l y i n d e p e n d e n t given G

if P(G) -r 0 and one of the following holds:

1 P(EIF C3 G) = P(EIG) and

2 P(EIG) = 0 or P(FIG ) = 0

P(EIG) r 0, P(FIG)r 0

Notice t h a t this definition is identical to the definition of independence ex- cept t h a t everything is conditional on G The definition entails t h a t E and F are independent once we know t h a t the outcome is in G The next example illustrates this

Trang 27

E x a m p l e 2.6 Let ~ be the set of all objects in Figure 2.1 Using the Principle

of Indifference, we assign a probability of 1/13 to each object Let Black be the set of all black objects, White be the set of all white objects, Square be the set

of all square objects, and A be the set os all objects containing an "A." We then have

5

P(A) = 1 -3

3 P(AlSquare ) = g

So A and Square are not independent However,

P(AIBlack ) = P(AlSquare F1 Black) =

1

P(AlSquarerqWhite) = ~

So A and Square are also conditionally independent given White

Next, we discuss an important rule involving conditional probabilities Sup- pose we have n events Ez, E 2 , , En such that

Ei F1 Ej - 0 for i # j and

Ez U E 2 U U E n - ~ Such events are called m u t u a l l y exclusive a n d e x h a u s t i v e Then the law

of t o t a l p r o b a b i l i t y says for any other event F,

P(F) - P(FFq El) + P(F N E2) +. + P(F fq En) (2.1)

Trang 28

2.1 P R O B A B I L I T Y B A S I C S 15

You are asked to prove this rule in the exercises If P(Ei) # 0, then P(F N Ei) P(FIE~)P(Ei ) Therefore, if P(Ei) ~ 0 for all i, the law is often applied in the following form:

P(F) - P(FIE~)P(E~ ) + P(FIE2)P(E2) + + P(FIEn)P(E~) (2.2)

E x a m p l e 2.7 Suppose we have the objects discussed in Example 2.6 Then according to the law of total probability

P(A) - P(AIBlack)P(Black ) + P(A[White)P(White)

Furthermore, given n mutually exclusive and exhaustive events El, 1:2, , En

such that P(Ei) ~ 0 for all i, we have for 1 <_ i ~_ n,

P(FIE~)P(E~) P(E~IF ) - p(FIEI)P(E1) + P(FIE2)P(E2) + P(FIEn)P(En)" (2.4)

P r o o f To obtain Equality 2.3, we first use the definition of conditional prob- ability as follows:

P(EiF) - P(E A F) P(F) and p(FIE) _ P(F N E)

P(E)

Next we multiply each of these equalities by the denominator on its right side

to show that

P(EIF)P(F ) - P(FIE)P(E )

because they both equal P ( [ N F) Finally, we divide this last equality by P(F)

to obtain our result

To obtain Equality 2.4, we place the expression for F, obtained using the rule of total probability (Equality 2.2), in the denominator of Equality 2.3 9

Both of the formulas in the preceding theorem are called B a y e s ' T h e o r e m because the original version was developed by Thomas Bayes (published in 1763) The first enables us to compute P ( E [ F ) i f we know P(FIE), P(E), and P(F), while the second enables us to compute P(Ei[F) if we know P(FIEj) and

P ( E j ) for 1 _< j _< n The next example illustrates the use of Bayes' Theorem

Trang 29

which i~ th~ ~ ~ .~lu~ we g~t by computing P(B,~r directly

In the previous example we can just as easily compute P(BlacklA ) directly

We will see a useful application of Bayes' Theorem in Section 2.4

2 2 R a n d o m V a r i a b l e s

In this section we present the formal definition and mathematical properties of

a random variable In Section 2.4 we show how they are developed in practice

2.2.1 P r o b a b i l i t y D i s t r i b u t i o n s of R a n d o m Variables Definition 2.5 Given a probability space (f~, P), a r a n d o m v a r i a b l e X is a function whose domain is f~

The range of X is called the space of X

E x a m p l e 2.9 Let f~ contain all outcomes of a throw os a pair os six-sided dice,

and let P assign 1/36 to each outcome Then f~ is the following set os ordered pairs:

The space of X is {2, 3, 4, 5, 6, r, 8, 9,10,11,12}, and that of Y is {odd, even}

Trang 30

E x a m p l e 2 1 0 Let f~, P, and X be as in Example 2.9 Then

X - 3 represents the event { (1, 2), (2, 1) } and

X - x, Y - y r e p r e s e n t s t h e event

{e such t h a t X ( e ) - x} A {e such t h a t Y(e) = y}

E x a m p l e 2 1 2 Let f~, P, X , and Y be as in Example 2.9 Then

X - 4, Y - odd represents t h e e v e n t { ( 1 , 3 ) , ( 3 , 1 ) } ,

and so

P ( X - 4, Y - o d d ) - 1/18

Trang 31

18 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S

We call P ( X - x, Y - y) the j o i n t p r o b a b i l i t y d i s t r i b u t i o n of X and

Y If A - {X, Y}, we also call this the j o i n t p r o b a b i l i t y d i s t r i b u t i o n of A Furthermore, we often just say "joint distribution" or "probability distribution." For brevity, we often use x, y to represent the event X = x, Y y, and so

we write P ( x , y) instead of P ( X x, Y y) This concept extends to three

or more r a n d o m variables For example, P ( X - x, Y - y, Z = z) is the joint

probability distribution function of the r a n d o m variables X , Y, and Z, and we

often write P ( x , y, z)

E x a m p l e 2 1 3 L e t ft, P , X , and Y be as in E x a m p l e 2.9 T h e n i f x = 4 and

y = odd,

P ( x , y) = P ( X = x, Y = y) = 1/18

If we w a n t to refer to all values of, for example, the r a n d o m variables X and

Y, we sometimes write P ( X , Y ) instead of P ( X = x, Y = y) or P ( x , y)

E x a m p l e 2 1 4 L e t ft, P, X , and Y be as in E x a m p l e 2.9 It is left as an exercise to s h o w t h a t for all values o f x and y we have

Trang 32

In Equality 2.5 the probability distribution P ( X - x) is called the m a r g i n a l

p r o b a b i l i t y d i s t r i b u t i o n of X relative to the joint distribution P ( X - x, Y -

y) because it is obtained using a process similar to adding across a row or column

in a table of numbers This concept also extends in a straightforward way to three or more random variables For example, if we have a joint distribution

P ( X - x, Y - y, Z - z) of X, Y, and Z, the marginal distribution P ( X -

x, Y - y) of X and Y is obtained by summing over all values of Z If A - {X, Y}, we also call this the m a r g i n a l p r o b a b i l i t y d i s t r i b u t i o n of A The next example reviews the concepts covered so far concerning random variables

E x a m p l e 2.18 Let ~ be a set of 12 individuals, and let P assign 1/12 to each

Trang 33

The joint distribution of S and H is as follows:

s h P(s,h) female 64 1/3

1/6 1/6 1/6

II 1/2 1/3 1/6 II

1/2

The table that follows shows the first few values in the joint distribution of S,

H, and W There are 18 values in all, many of which are 0

s h w P ( s , h , w ) female 64 30,000 1/6

female 64 40,000 1/6

female 64 50,000 0

female 68 30,000 1/12

We close with the chain rule for random variables, which says that given

n random variables X1, X 2 , , Xn, defined on the same sample space ~,

P(xl,x2, ,Xn) = P(xnlxn-~,Xn-2, ,Xl)"'" X P(x21x~) x P(x~)

whenever P ( x l , x 2 , , X n ) # O It is straightforward to prove this rule using the rule for conditional probability

Trang 34

The notion of independence extends naturally to random variables

D e f i n i t i o n 2.6 Suppose we have a probability space (f~, P) and two random variables X and Y defined on ft Then X and Y are independent if, for all values z of X and y of Y, the events X = x and Y = y are independent When this is the case, we write

I p ( X , Y ) , where IF stands for independent in P

E x a m p l e 2.20 Let f~ be the set of all cards in an ordinary deck, and let P assign 1/52 to each card Define random variables as follows:

Variable Value Outcomes Mapped to This Value

/i~ rl All royal cards

r2 All nonroyal cards

s2 All nonspades .

Then the random variables R and S are independent That is,

Trang 35

d e p e n d e n t g i v e n Z if for M1 values x of X , y of Y, and z of Z, whenever

P ( z ) ~ O, the events X = x and Y = y are conditionMly independent given the even Z = z When this is the case, we write

Outcomes Mapped to This Value

All objects containing an "A"

All objects containing a "B"

All square objects All circular objects All black objects All white objects

Then L and S are conditionally independent given C That is,

It is left as an exercise to show that it holds for the other combinations

Independence and conditional independence can also be defined for sets of random variables

Trang 36

2.2 R A N D O M V A R I A B L E S 23

D e f i n i t i o n 2.8 Suppose we have a probability space (f~, P) and two sets A and

B containing random variables defined on f~ Let a and b be sets of values of the random variables in A and [3, respectively The sets A and B are said to be

i n d e p e n d e n t if, for all values of the variables in the sets a and b, the events

A = a and g = b are independent When this is the case, we write

IN(A, B),

where IN stands for independent in P

E x a m p l e 2 2 2 Let f~ be the set of all cards in an ordinary deck, and let P assign 1/52 to each card Define random variables as follows:

Variable

R

T

Value Outcomes Mapped to This Value

tl All tens and jacks t2 All cards t h a t are neither tens nor jacks

It is left as an exercise to show that it holds for the other combinations

W h e n a set contains a single variable, we do not ordinarily show the braces For example, we write I n d e p e n d e n c y 2.7 as

T}, S)

D e f i n i t i o n 2.9 Suppose we have a probability space (~, P) and three sets A, [3, and C containing random variables defined on ft Let a, b, and c be sets of values of the random variables in A, [3, and C, respectively Then the sets A and B are said to be c o n d i t i o n a l l y i n d e p e n d e n t g i v e n t h e s e t C if, for all values of the variables in the sets a, b, and c, whenever P ( c ) ~ O, the events

A = a and [3 - b are conditionally independent given the event C - c When this is the case, we write

Zp(A, BIC)

Trang 37

24 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S

Figure 2.2: Objects with five properties

E x a m p l e 2.23 Suppose we use the Principle of Indifference to assign proba- bilities to the objects in Figure 2.2, and define random variables as follows:

All objects containing a "1"

All objects containing a "2"

All objects covered with lines All objects not covered with lines All grey objects

All white objects All square objects All circular objects All objects containing a number in a large font All objects containing a number in a small font

It is left as an exercise to show for all values of v, l, c, s, and f that

So we have

Ip({V, L}, { S , F } I C )

2.3 The Meaning of Probability

When one does not have the opportunity to study probability theory in depth, one is often left with the impression that all probabilities are computed using ratios Next, we discuss the meaning of probability in more depth and show that this is not how probabilities are ordinarily determined

Trang 38

2.3 THE M E A N I N G OF P R O B A B I L I T Y 25

A classic t e x t b o o k example of probability concerns tossing a coin Because the coin is symmetrical, we use the Principle of Indifference to assign

P(Heads) = P(Tails) -.5

Suppose instead we toss a thumbtack It can also land one of two ways That

is, it could land on its fiat end, which we will call "heads," or it could land with the edge of the fiat end and the point touching the ground, which we will call

"tails." Because the thumbtack is not symmetrical, we have no reason to apply the Principle of Indifference and assign probabilities of 5 to both outcomes How then should we assign the probabilities? In the case of the coin, when

we assign P(heads) - 5, we are implicitly assuming t h a t if we tossed the coin a large number of times it would land heads about half the time T h a t

is, if we tossed the coin 1000 times, we would expect it to land heads about

500 times This notion of repeatedly performing the experiment gives us a

m e t h o d for computing (or at least estimating) the probability T h a t is, if we repeat an experiment m a n y times, we are fairly certain t h a t the probability of

an outcome is about equal to the fraction of times the outcome occurs For example, a student tossed a t h u m b t a c k 10,000 times and it landed heads 3761 times So

3761 P(Heads) ~ = 3761

10,000 Indeed, in 1919 Richard von Mises used the limit of this fraction as the definition

of probability T h a t is, if n is the number of tosses and S~ is the number of times the t h u m b t a c k lands heads, then

This approach to probability is called the relative frequency approach to probability, and probabilities obtained using this approach are called relative frequencies A frequentist is someone who feels this is the only way we can obtain probabilities Note that, according to this approach, we can never know

a probability for certain For example, if we tossed a coin 10,000 times and it landed heads 4991 times, we would estimate

4991

10,000

Trang 39

26 C H A P T E R 2 P R O B A B I L I T Y A N D S T A T I S T I C S

On the other hand, if we used the Principle of Indifference, we would assign P(Heads) = 5 In the case of the coin, the probability may not actually be 5 because the coin may not be perfectly symmetrical For example, Kerrich [1946] found t h a t the six came up the most in the toss of a die and t h a t one came up the least This makes sense because, at t h a t time, the spots on the die were hollowed out of the die So the die was lightest on the side with a six On the other hand, in experiments involving cards or urns, it seems we can be certain

of probabilities obtained using the Principle of Indifference

E x a m p l e 2.24 Suppose we toss an asymmetrical six-sided die, and in 1000

tosses we observe the six sides coming up the following n u m b e r of times:

E x a m p l e 2.25 Suppose our population is all males in the United States be-

tween the ages of 31 and 85, and we are interested in the probability of such

males having high blood pressure Then if we sample 10,000 males, this set of males is our sample, f-hrthermore, if 3210 have high blood pressure, we estimate

P( High Blood Pressure) ~ 3210

in the group t h a t have high blood pressure In theory, we would have to have

an infinite number of males to determine the probability exactly The current set of males in this age group is called a f i n i t e p o p u l a t i o n The fraction

of t h e m with high blood pressure is the probability of obtaining a male with high blood pressure when we sample him from the set of all males in the age group This latter probability is simply the ratio of males with high blood pressure W h e n doing statistical inference, we sometimes want to estimate the ratio in a finite population from a sample of the population, and other times

we want to estimate a propensity from a finite sequence of observations For

Trang 40

2.3 T H E M E A N I N G O F P R O B A B I L I T Y 27

example, T V raters ordinarily want to estimate the actual fraction of people in

a nation watching a show from a sample of those people On the other hand, medical scientists want to estimate the propensity with which males tend to have high blood pressure from a finite sequence of males One can create an infinite sequence from a finite population by returning a sampled item back

to the population before sampling the next item This is called " s a m p l i n g

w i t h r e p l a c e m e n t " In practice, it is rarely done, but ordinarily the finite population is so large t h a t statisticians make the simplifying assumption t h a t

it is done T h a t is, they do not replace the item, but still assume the ratio is unchanged for the next item sampled

W h e n sampling, the observed relative frequency is called the m a x i m u m

l i k e l i h o o d e s t i m a t e of the probability (limit of the relative frequency) be- cause it is the estimate of the probability t h a t makes the observed sequence most probable when we assume the trials (repetitions of the experiment) are probabilistically independent

E x a m p l e 2.26 S u p p o s e we toss a t h u m b t a c k four t i m e s a n d we observe t h e

s e q u e n c e [heads, t a i l s , h e a d s , heads] T h e n t h e m a x i m u m likelihood e s t i m a t e o f

Ngày đăng: 30/03/2017, 16:49

TỪ KHÓA LIÊN QUAN

w