Building probabilistic graphical models with python

The chain rule The chain rule allows us to calculate the joint distribution of a set of random variables using their conditional probabilities.. The set of random variables over which th

Trang 2

Building Probabilistic Graphical Models with Python

Solve machine learning problems using probabilistic graphical models implemented in Python with

real-world applications

Kiran R Karkera

BIRMINGHAM - MUMBAI

Trang 3

All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews

Every effort has been made in the preparation of this book to ensure the accuracy

of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information.First published: June 2014

Trang 4

Mariammal Chettiyar Hemangini Bari

Graphics

Disha Haria Yuvraj Mannari Abhinash Sahu

Production Coordinator

Alwin Roy

Cover Work

Alwin Roy

Trang 5

About the Author

Kiran R Karkera is a telecom engineer with a keen interest in machine learning

He has been programming professionally in Python, Java, and Clojure for more than

10 years In his free time, he can be found attempting machine learning competitions

at Kaggle and playing the flute

I would like to thank the maintainers of Libpgm and OpenGM

libraries, Charles Cabot and Thorsten Beier, for their help with

the code reviews

Trang 6

About the Reviewers

Mohit Goenka graduated from the University of Southern California (USC) with

a Master's degree in Computer Science His thesis focused on game theory and human behavior concepts as applied in real-world security games He also received

an award for academic excellence from the Office of International Services at the University of Southern California He has showcased his presence in various realms

of computers including artificial intelligence, machine learning, path planning, multiagent systems, neural networks, computer vision, computer networks, and operating systems

During his tenure as a student, Mohit won multiple competitions cracking codes

and presented his work on Detection of Untouched UFOs to a wide range of audience

Not only is he a software developer by profession, but coding is also his hobby He spends most of his free time learning about new technology and grooming his skills.What adds a feather to Mohit's cap is his poetic skills Some of his works are part

of the University of Southern California libraries archived under the cover of the Lewis Carroll Collection In addition to this, he has made significant contributions by volunteering to serve the community

Shangpu Jiang is doing his PhD in Computer Science at the University of Oregon

He is interested in machine learning and data mining and has been working in this area for more than six years He received his Bachelor's and Master's

degrees from China

Trang 7

Science at the University of Oregon He is a member of the OSIRIS lab His research direction involves system security, embedded system security, trusted computing, and static analysis for security and virtualization He is interested in Linux kernel hacking and compilers He also spent a year on AI and machine learning direction

and taught the classes Intro to Problem Solving using Python and Operating Systems in

the Computer Science department Before that, he worked as a software developer

in the Linux Control Platform (LCP) group at the Alcatel-Lucent (former Lucent Technologies) R&D department for around four years He got his Bachelor's and Master's degrees from EE in China

Thanks to the author of this book who has done a good job for both

Python and PGM; thanks to the editors of this book, who have made

this book perfect and given me the opportunity to review such a nice

book

Xiao Xiao is a PhD student studying Computer Science at the University of Oregon Her research interests lie in machine learning, especially probabilistic graphical models Her previous project was to compare two inference algorithms' performance

on a graphical model (relational dependency network)

Trang 8

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related

to your book

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details

At www.PacktPub.com, you can also read a collection of free technical articles, sign

up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks

TM

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library Here, you can access, read and search across Packt's entire library of books

Why Subscribe?

• Fully searchable across every book published by Packt

• Copy and paste, print and bookmark content

• On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books Simply use your login credentials for immediate access

Trang 10

Table of Contents

Preface 1

Trang 11

D-separation 29

Summary 37Chapter 3: Undirected Graphical Models 39

Summary 68

Trang 12

Effects of data fragmentation on parameter estimation 77

Summary 91Chapter 6: Exact Inference Using Graphical Models 93

Learning the induced width from the graph structure 109

Summary 119Chapter 7: Approximate Inference Methods 121

Visualizing unary and pairwise factors on a 3 x 3 grid 129

Trang 13

The Markov Chain Monte Carlo sampling process 138

Summary 145

Index 151

Trang 14

In this book, we start with an exploratory tour of the basics of graphical models, their types, why they are used, and what kind of problems they solve We then explore subproblems in the context of graphical models, such as their representation, building them, learning their structure and parameters, and using them to answer our inference queries

This book attempts to give just enough information on the theory, and then use code samples to peep under the hood to understand how some of the algorithms are implemented The code sample also provides a handy template to build graphical models and answer our probability queries Of the many kinds of graphical

models described in the literature, this book primarily focuses on discrete Bayesian networks, with occasional examples from Markov networks

What this book covers

Chapter 1, Probability, covers the concepts of probability required to understand the

graphical models

Chapter 2, Directed Graphical Models, provides information about Bayesian

networks, their properties related to independence, conditional independence, and D-separation This chapter uses code snippets to load a Bayes network and understand its independence properties

Chapter 3, Undirected Graphical Models, covers the properties of Markov networks,

how they are different from Bayesian networks, and their independence properties

Chapter 4, Structure Learning, covers multiple approaches to infer the structure of the

Bayesian network using a dataset We also learn the computational complexity of structure learning and use code snippets in this chapter to learn the structures given

in the sampled datasets

Trang 15

Chapter 5, Parameter Learning, covers the maximum likelihood and Bayesian

approaches to parameter learning with code samples from PyMC

Chapter 6, Exact Inference Using Graphical Models, explains the Variable Elimination

algorithm for accurate inference and explores code snippets that answer our

inference queries using the same algorithm

Chapter 7, Approximate Inference Methods, explores the approximate inference for

networks that are too large to run exact inferences on We will also go through the code samples that run approximate inferences using loopy belief propagation on Markov networks

Appendix, References, includes all the links and URLs that will help to easily

understand the chapters in the book

What you need for this book

To run the code samples in the book, you'll need a laptop or desktop with IPython installed We use several software packages in this book, most of them can be installed using the Python installation procedure such as pip or easy_install In some cases, the software needs to be compiled from the source and may require a C++ compiler

Who this book is for

This book is aimed at developers conversant with Python and who wish to explore the nuances of graphical models using code samples

This book is also ideal for students who have been theoretically introduced to graphical models and wish to realize the implementations of graphical models and get a feel for the capabilities of different (graphical model) libraries to deal with real-world models

Machine-learning practitioners familiar with classification and regression models and who wish to explore and experiment with the types of problems graphical models can solve will also find this book an invaluable resource

This book looks at graphical models as a tool that can be used to solve problems

in the machine-learning domain Moreover, it does not attempt to explain the mathematical underpinnings of graphical models or go into details of the steps for each algorithm used

Trang 16

In this book, you will find a number of styles of text that distinguish between

different kinds of information Here are some examples of these styles, and an explanation of their meaning

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:

"We can do the same by creating a TfidfVectorizer object."

A block of code is set as follows:

clf = MultinomialNB(alpha=.01)

print "CrossValidation Score: ", np.mean(cross_validation.cross_val_ score(clf,vectors, newsgroups.target, scoring='f1'))

CrossValidation Score: 0.954618416381

Warnings or important notes appear in a box like this

Tips and tricks appear like this

Reader feedback

Feedback from our readers is always welcome Let us know what you think about this book—what you liked or may have disliked Reader feedback is important for us

to develop titles that you really get the most out of

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and mention the book title via the subject of your message

If there is a topic that you have expertise in and you are interested in either writing

or contributing to a book, see our author guide on www.packtpub.com/authors

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase

Trang 17

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com If you purchased this book

elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes

do happen If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us By doing so, you can save other readers from frustration and help us improve subsequent versions of this book If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link,

and entering the details of your errata Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title Any existing errata can be viewed

by selecting your title from http://www.packtpub.com/support

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media

At Packt, we take the protection of our copyright and licenses very seriously If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy

Please contact us at copyright@packtpub.com with a link to the suspected

Trang 18

ProbabilityBefore we embark on the journey through the land of graphical models, we must equip ourselves with some tools that will aid our understanding We will first start with a tour of probability and its concepts such as random variables and the types

of distributions

We will then try to understand the types of questions that probability can help

us answer and the multiple interpretations of probability Finally, we will take a quick look at the Bayes rule, which helps us understand the relationships between probabilities, and also look at the accompanying concepts of conditional probabilities and the chain rule

The theory of probability

We often encounter situations where we have to exercise our subjective belief about

an event's occurrence; for example, events such as weather or traffic that

are inherently stochastic Probability can also be understood as the degree of

to each outcome to encapsulate our degree of belief in those outcomes An example

of the notation used to express our belief is P(rainy)=0.3, which can be read as the

probability of rain is 0.3 or 30 percent

Trang 19

The axioms of probability that have been formulated by Kolmogorov are stated

as follows:

• The probability of an event is a non-negative real number (that is, the

probability that it will rain today may be small, but nevertheless will be greater than or equal to 0) This is explained in mathematical terms as follows:

(E) , (E) 0

P P E F where F is the event space

• The probability of the occurrence of some event in the sample space is 1 (that

is, if the weather events in our sample space are rainy, sunny, and cloudy, then one of these events has to occur), as shown in the following formula:

( ) 1

P Ω = where is the sample space Ω

• The sum of the probabilities of mutually exclusive events gives their union,

as given in the following formula:

of a fair coin translates to the fact that the controlling parameter has a value of 0.5

in favor of heads, which also translates to the fact that we assume all the outcomes

to be equally likely Later in the book, we shall examine how many parameters are required to completely specify a probability distribution However, we are getting ahead of ourselves First let's learn about probability distribution

A probability distribution consists of the probabilities associated with each

measurable outcome In the case of a discrete outcome (such as a throw of a dice or a coin flip), the distribution is specified by a probability mass function, and in the case

of a continuous outcome (such as the height of students in a class), it is specified by a probability density function

Let us see discrete distributions with an example A coin flip has two outcomes: heads and tails, and a fair coin assigns equal probabilities to all outcomes This means that the probability distribution is simple—for heads, it is 0.5 and for tails, it

is 0.5 A distribution like this (for example, heads 0.3 and tails 0.7) would be the one that corresponds to a biased coin The following graph shows the discrete probability distribution for the sum of values when two dice are thrown:

Trang 20

A distribution that assigns equal probabilities to all outcomes is called a uniform distribution This is one of the many distributions that we will explore.

Let's look at one of the common distributions associated with continuous outcomes, that is, the Gaussian or normal distribution, which is in the shape of a bell and hence called a bell curve (though there are other distributions whose shapes are similar to the bell shape) The following are some examples from the real world:

• Heights of students in a class are log-normally distributed (if we take the logarithm of the heights of students and plot it, the resulting distribution is normally distributed)

• Measurement errors in physical experiments

A Gaussian distribution has two parameters: mean (µ) and variance (σ2) The parameters mean and variance determine the middle point and the dispersion of the distribution away from the mean, respectively

Trang 21

The following graph shows multiple Gaussian distributions with different values

of mean and variance It can be seen that the more variance there is, the broader the

distribution, whereas the value of the mean shifts the peak on the x axis, as shown in

the following graph:

Goals of probabilistic inference

Now that we have understood the concept of probability, we must ask ourselves how this is used The kind of questions that we ask fall into the following categories:

• The first question is parameter estimation, such as, is a coin biased or fair? And if biased, what is the value of the parameter?

• The second question is that given the parameters, what is the probability of the data? For example, what is the probability of five heads in a row if we flip

a coin where the bias (or parameter) is known

The preceding questions depend on the data (or lack of it) If we have a set

of observations of a coin flip, we can estimate the controlling parameter (that is, parameter estimation) If we have an estimate of the parameter, we would like to estimate the probability of the data generated by the coin flips (the second question) Then, there are times when we go back and forth to improve the model

• Is the model well-suited to the problem, is the third question that we may enquire about Is there a single parameter that controls the results of the coin flipping experiment? When we wish to model a complicated phenomena (such as the traffic or weather prediction), there certainly exist several

parameters in the model, where hundreds or even thousands of parameters are not unusual In such cases, the question that we're trying to ask is, which model fits the data better? We shall see some examples in the later chapters

on different aspects of model fit

Trang 22

Conditional probability

Let us use a concrete example, where we have a population of candidates who are

applying for a job One event (x) could be a set of all candidates who get an offer, whereas another event (y) could be the set of all highly experienced candidates We

might want to reason about the set of a conjoint event (x y∩ ), which is the set of experienced candidates who got an offer (the probability of a conjoint event P(x y∩ )

is also written as P( , )x y ) The question that raises is that if we know that one event

has occurred, does it change the probability of occurrence of the other event In this case, if we know for sure that a candidate got an offer, what does it tell us about their experience?

Conditional probability is formally defined as P( | )x y p x y(p y( ))

∩

= , which can be read as

the probability of x given that y occurred The denominator P y( ) is the sum of all

possible outcomes of the joint distribution with the value of x summed out,

that is, ∑x P(x, y) =P(y)

The chain rule

The chain rule allows us to calculate the joint distribution of a set of random

variables using their conditional probabilities In other words, the joint distribution

is the product of individual conditional probabilities Since (P x y∩ )=P x P y x( ) ( | ), and if

1 , 2 n

a aKa are events, P a( 1∩ ∩ K a n) =P a P a a( ) ( | )1 2 1KP a( n−1| )a n

We shall return to this in detail in graphical models, where the chain rule helps us decompose a big problem (computing the joint distribution) by splitting it into smaller problems (conditional probabilities)

The Bayes rule

The Bayes rule is one of the foundations of the probability theory, and we won't go into much detail here It follows from the definition of conditional probability, as shown in the following formula:

( | ) ( ) ( | )

Trang 23

From the formula, we can infer the following about the Bayes rule—we entertain prior beliefs about the problem we are reasoning about This is simply called the prior term When we start to see the data, our beliefs change, which gives rise to our final belief (called the posterior), as shown in the following formula:

posterior prior likelihood α ×

Let us see the intuition behind the Bayes rule with an example Amy and Carl are standing at a railway station waiting for a train Amy has been catching the same train everyday for the past year, and it is Carl's first day at the station What would

be their prior beliefs about the train being on time?

Amy has been catching the train daily for the past year, and she has always seen the train arrive within two minutes of the scheduled departure time Therefore, her strong belief is that the train will be at most two minutes late Since it is Carl's first day, he has no idea about the train's punctuality However, Carl has been traveling the world in the past year, and has been in places where trains are not known to be punctual Therefore, he has a weak belief that the train could be even 30 minutes late

On day one, the train arrives 5 minutes late The effect this observation has on both Amy and Carl is different Since Amy has a strong prior, her beliefs are modified a little bit to accept that the train can be as late as 5 minutes Carl's beliefs now change

in the direction that the trains here are rather punctual

In other words, the posterior beliefs are influenced in multiple ways: when

someone with a strong prior sees a few observations, their posterior belief does not change much as compared to their prior On the other hand, when someone with a weak prior sees numerous observations (a strong likelihood), their posterior belief changes a lot and is influenced largely by the observations (likelihood) rather than their prior belief

Let's look at a numerical example of the Bayes rule D is the event that an

athlete uses performance-enhancing drugs (PEDs) T is the event that the drug test

returns positive Throughout the discussion, we use the prime (') symbol to notate that the event didn't occur; for example, D' represents the event that the athlete

didn't use PEDs

P(D|T) is the probability that the athlete used PEDs given that the drug test returned positive P(T|D) is the probability that the drug test returned positive given that the

athlete used PEDs

Trang 24

The lab doing the drug test claims that it can detect PEDs 90 percent of the time We also learn that the false-positive rate (athletes whose tests are positive but did not use PEDs) is 15 percent, and that 10 percent of athletes use PEDs What is the probability that an athlete uses PEDs if the drug test returned positive?

From the basic form of the Bayes rule, we can write the following formula:

• P(T|D): This is equal to 0.90

• P(T|D'): This is equal to 0.15 (the test that returns positive given that the

athlete didn't use PEDs)

Interpretations of probability

In the previous example, we noted how we have a prior belief and that the

introduction of the observed data can change our beliefs That viewpoint, however,

is one of the multiple interpretations of probability

The first one (which we have discussed already) is a Bayesian interpretation, which holds that probability is a degree of belief, and that the degree of belief changes before and after accounting for evidence

The second view is called the Frequentist interpretation, where probability measures the proportion of outcomes and posits that the prior belief is an incorrect notion that

is not backed up by data

Trang 25

To illustrate this with an example, let's go back to the coin flipping experiment, where we wish to learn the bias of the coin We run two experiments, where we flip the coin 10 times and 10000 times, respectively In the first experiment, we get 7 heads and in the second experiment, we get 7000 heads.

From a Frequentist viewpoint, in both the experiments, the probability of getting heads is 0.7 (7/10 or 7000/10000) However, we can easily convince ourselves that

we have a greater degree of belief in the outcome of the second experiment than that

of the first experiment This is because the first experiment's outcome has a Bayesian perspective that if we had a prior belief, the second experiment's observations would overwhelm the prior, which is unlikely in the first experiment

For the discussion in the following sections, let us consider an example of a company that is interviewing candidates for a job Prior to inviting the candidate for an

interview, the candidate is screened based on the amount of experience that the candidate has as well as the GPA score that the candidate received in his graduation results If the candidate passes the screening, he is called for an interview Once the candidate has been interviewed, the company may make the candidate a job offer (which is based on the candidate's performance in the interview) The candidate is also evaluating going for a postgraduate degree, and the candidate's admission to

a postgraduate degree course of his choice depends on his grades in the bachelor's degree The following diagram is a visual representation of our understanding of the relationships between the factors that affect the job selection (and postgraduate degree admission) criteria:

Job Offer

Degree score

Job Interview

Experience

Postgraduate degree admission

Trang 26

Random variables

The classical notion of a random variable is the one whose value is subject to

variations due to chance (Wikipedia) Most programmers have encountered random numbers from standard libraries in programming languages From a programmer's perspective, unlike normal variables, a random variable returns a new value every time its value is read, where the value of the variable could be the result of a new invocation of a random number generator

We have seen the concept of events earlier, and how we could consider the

probability of a single event occurring out of the set of measurable events It may be suitable, however, to consider the attributes of an outcome

In the candidate's job search example, one of the attributes of a candidate is his experience This attribute can take multiple values such as highly relevant or not relevant The formal machinery for discussing attributes and their values in different outcomes is called random variables [Koller et al, 2.1.3.1]

Random variables can take on categorical values (such as {Heads, Tails} for the

outcomes of a coin flip) or real values (such as the heights of students in a class)

Marginal distribution

We have seen that the job hunt example (described in the previous diagram) has five

random variables They are grades, experience, interview, offer, and admission These

random variables have a corresponding set of events

Now, let us consider a subset X of random variables, where X contains only the Experience random variable This subset contains the events highly relevant and not relevant.

If we were to enlist the probabilities of all the events in the subset X, it would

be called a marginal distribution, an example of which can be found in the

following formula:

P Experience Highly relevant=′ ′ = P Experience Not Relevant=′ =

Like all valid distributions, the probabilities should sum up to 1

The set of random variables (over which the marginal distribution is described) can contain just one variable (as in the previous example), or it could contain several

variables, such as {Experience, Grades, Interview}.

Trang 27

Joint distribution

We have seen that the marginal distribution is a distribution that describes a subset

of random variables Next, we will discuss a distribution that describes all the random variables in the set This is called a joint distribution Let us look at the joint

distribution that involves the Degree score and Experience random variables in the job

Once the joint distribution is described, the marginal distribution can be found by summing up individual rows or columns In the preceding table, if we sum up the

columns, the first column gives us the probability for Highly relevant, and the second column for Not relevant It can be seen that a similar tactic applied on the rows gives

us the probabilities for degree scores

We can define the concept of independence in multiple ways Assume that we have

two events a and b and the probability of the conjunction of both events is simply the

product of their probabilities, as shown in the following formula:

P a b =P a P b

Trang 28

If we were to write the probability of P a b( ), as P a P b( ) ( | a) (that is, the product of

probability of a and probability of b given that the event a has happened), if the events a and b are independent, they resolve to ( ) ( ) P a P b

An alternate way of specifying independence is by saying that the probability of a given b is simply the probability of a, that is, the occurrence of b does not affect the probability of a, as shown in the following formula:

P= a b if P a b⊥ =P a or if P b =

It can be seen that the independence is symmetric, that is, (a b implies b a⊥ ) ( ⊥ ) Although this definition of independence is in the context of events, the same concept can be generalized to the independence of random variables

Conditional independence

Given two events, it is not always obvious to determine whether they are

independent or not Consider a job applicant who applied for a job at two companies, Facebook and Google It could result in two events, the first being that a candidate gets an interview call from Google, and another event that he gets an interview call from Facebook Does knowing the outcome of the first event tell us anything about the probability of the second event? Yes, it does, because we can reason that if a candidate is smart enough to get a call from Google, he is a promising candidate, and that the probability of a call from Facebook is quite high

What has been established so far is that both events are not independent Supposing

we learn that the companies decide to send an interview invite based on the

candidate's grades, and we learn that the candidate has an A grade, from which we infer that the candidate is fairly intelligent We can reason that since the candidate

is fairly smart, knowing that the candidate got an interview call from Google does not tell us anything more about his perceived intelligence, and that it doesn't change the probability of the interview call from Facebook This can be formally annotated

as the probability of a Facebook interview call, given a Google interview call AND grade A, is equal to the probability of a Facebook interview call given grade A, as shown in the following formula:

In other words, we are saying that the invite from Facebook is conditionally

independent of the invite from Google, given the candidate has grade A

Trang 29

Types of queries

Having learned about joint and conditional probability distributions, let us turn our attention to the types of queries we can pose to these distributions

Probability queries

This is the most common type of query, and it consists of the following two parts:

• The evidence: This is a subset E of random variables which have

been observed

• The query: This is a subset Y of random variables

We wish to compute the value of the probability P Y E e( | = ), which is the posterior

probability or the marginal probability over Y Using the job seeker example again,

we can compute the marginal distribution over an interview call, conditioned on the

fact that Degree score = Grade A.

MAP queries

Maximum a posteriori (MAP) is the highest probability joint assignment to some

subsets of variables In the case of the probability query, it is the value of the

probability that matters In the case of MAP, calculating the exact probability value

of the joint assignment is secondary as compared to the task of finding the joint assignment to all the random variables

It is possible to return multiple joint assignments if the probability values are equal

We shall see from an example that in the case of the joint assignment, it is possible that the highest probability from each marginal value may not be the highest joint assignment (the following example is from Koller et al)

Consider two non-independent random variables X and Y, where Y is dependent on

X The following table shows the probability distribution over X:

Trang 30

We can see that the MAP assignment for the random variable X is X1 since it has a higher value The following table shows the marginal distribution over X and Y:

In the joint distribution shown in the preceding table, the MAP assignment to

random variables (X, Y) is (X0, Y1), and that the MAP assignment to X (X1) is not a part of the MAP of the joint assignment To sum up, the MAP assignment cannot

be obtained by simply taking the maximum probability value in the marginal

distribution for each random variable

A different type of MAP query is a marginal MAP query where we only have a subset

of the variables that forms the query, as opposed to the joint distribution In the

previous example, a marginal MAP query would be MAP (Y), which is the maximum value of the MAP assignment to the random variable Y, which can be read by looking

at the joint distribution and summing out the values of X From the following table, we can read the maximum value and determine that the MAP (Y) is Y1:

Assignment Value

Trang 31

The data for the marginal query has been obtained from

Querying Joint Probability Distributions by Sargur Srihari

You can find it at http://www.cedar.buffalo

QueryingProbabilityDistributions.pdf

edu/~srihari/CSE574/Chap8/Ch8-PGM-Directed/8.1.2-Summary

In this chapter, we looked at the concepts of basic probability, random variables, and the Bayes theorem We also learned about the chain rule and joint and marginal distributions with the use of a candidate job search example, which we shall return

to in the later chapters Having obtained a good grasp on these topics, we can now move on to exploring Bayes and Markov networks in the forthcoming chapters, where we will formally describe these networks to answer some of the probability queries we discussed in this chapter While this chapter was completely theoretical, from the next chapter, we shall implement the Python code to seek answers to our questions

Trang 32

Directed Graphical Models

In this chapter, we shall learn about directed graphical models, which are also known as Bayesian networks We start with the what (the problem we are trying to solve), the how (graph representation), the why (factorization and the equivalence of CPD and graph factorization), and then move on to using the Libpgm Python library

to play with a small Bayes net

if the position G(i,j) contains 1, indicates an edge between i and j vertices In the case

of a directed graph, a value of 1 or -1 indicates the direction of the edge.

In many cases, we are interested in graphs in which all the edges are either

directed or undirected, leading to them being called directed graphs or

undirected graphs, respectively

The parents of a V1 node in a directed graph are the set of nodes that have outgoing edges that terminate at V1

The children of the V1 node are the set of nodes that have incoming edges

which leave V1

The degree of a node is the number of edges it participates in

A clique is a set of nodes where every pair of nodes is connected by an edge A maximal clique is the one that loses the clique property if it includes any other node

Trang 33

If there exists a path from a node that returns to itself after traversing the other nodes, it is called a cycle or loop.

A Directed Acyclic Graph (DAG) is a graph with no cycles.

A Partially Directed Acyclic Graph (PDAG) is a graph that can contain both

directed and undirected edges

A forest is a set of trees

pip install <packagename> or easy_install <packagename>

To use the code in this chapter, please install Libpgm (https://pypi.python.org/pypi/libpgm) and scipy (http://scipy.org/)

Independence and independent

parameters

One of the key problems that graphical models solve is in defining the joint

distribution Let's take a look at the job interview example where a candidate with a certain amount of experience and education is looking for a job The candidate is also applying for admission to a higher education program

We are trying to fully specify the joint distribution over the job offer, which

(according to our intuition) depends on the outcome of the job interview, the

candidate's experience, and his grades (we assume that the candidate's admission into a graduate school is not considered relevant for the job offer) Three random

variables {Offer, Experience, Grades} take two values (such as yes and no for the job

offer and highly relevant and not relevant for the job experience) and the interview takes on three values, and the joint distribution will be represented by a table that has 24 rows (that is, 2 x 2 x 2 x 3)

Trang 34

Each row contains a probability for the assignment of the random variables in that row While different instantiations of the table might have different probability assignments, we will need 24 parameters (one for each row) to encode the

information in the table However, for calculation purposes, we will need only 23 independent parameters Why do we remove one? Since the sum of probabilities equals 1, the last parameter can be calculated by subtracting one from the sum of the

23 parameters already found

Experience Grades Interview Offer Probability

It should be clear on observing the preceding table that acquiring a fully specified joint distribution is difficult due to the following reasons:

• It is too big to store and manipulate from a computational point of view

• We'll need large amounts of data for each assignment of the joint distribution

to correctly elicit the probabilities

• The individual probabilities in a large joint distribution become vanishingly small and are no longer meaningful to human comprehension

How can we avoid having to specify the joint distribution? We can do this by

using the concept of independent parameters, which we shall explore in the

following example

Trang 35

The joint distribution P over Grades and Admission is (throughout this book, the

superscript 0 and 1 indicate low and high scores) as follows:

Grades Admission Probability (S,A)

P S A =P S P A

The number of parameters required in the preceding formula is three, one parameter for P S( ) and two each for P A S( | 0) and P A S( | 1) Since this is a simple distribution, the number of parameters required is the same for both conditional and joint

distributions, but let's observe the complete network to see if the conditional

parameterization makes a difference:

Offer

Grades Experience

E=0 E=1

0.6 0.4

Assgn Prob

G=0 G=1

0.7 0.3

Assgn Prob

E=0, G=1 E=1, G=0 E=1, G=1

I=0 I=1 I=2

.8 3 3 1

.18 6 4 2

.02 1 3 7

I=0 I=1 I=2

O=0 O=1

.9 4 01

.1 6 99

Trang 36

How do we calculate the number of parameters in the Bayes net in the preceding diagram? Let's go through each conditional probability table, one parameter at a

time Experience and Grades take two values, and therefore need one independent parameter each The Interview table has 12 (3 x 4) parameters However, each row

sums up to 1, and therefore, we need two independent parameters per row The

whole table needs 8 (2 x 4) independent parameters Similarly, the Offer table has six

entries, but only 1 independent parameter per row is required, which makes 3 (1 x 3) independent parameters Therefore, the total number of parameters is 1 (Experience) + 1 (Grades)+ 12 (Interview) + 3 (Offer) amount to 17, which is a lot lesser than

24 parameters to fully specify the joint distribution Therefore, the independence assumptions in the Bayesian network helps us avoid specifying the joint distribution

The Bayes network

A Bayes network is a structure that can be represented as a directed acyclic graph, and the data it contains can be seen from the following two points of view:

• It allows a compact and modular representation of the joint distribution using the chain rule for Bayes network

• It allows the conditional independence assumptions between vertices to

et al 3.2.1.1) One probability distribution each exists for Experience and Grades, and

a conditional probability distribution (CPD) each exists for Interview and Offer

A CPD specifies a distribution over a random variable, given all the combinations

of assignments to its parents Thus, the modular representation for a given Bayes network is the set of CPDs for each random variable

The conditional independence viewpoint flows from the edges (our intuition draws) between different random variables, where we presume that a call for a job interview must be dependent on a candidate's experience as well as the score he received in his degree course, and the probability of a job offer depends solely on the outcome of the job interview

Trang 37

The chain rule

The chain rule allows us to define the joint distribution as a product of factors In the job interview example, using the chain rule for probability, we can write the following formula:

In this section, we shall look at different kinds of reasoning used in a Bayes network

We shall use the Libpgm library to create a Bayes network Libpgm reads the

network information such as nodes, edges, and CPD probabilities associated with each node from a JSON-formatted file with a specific format This JSON file is read into the NodeData and GraphSkeleton objects to create a discrete Bayesian network (which, as the name suggests, is a Bayes network where the CPDs take discrete values) The TableCPDFactorization object is an object that wraps the discrete Bayesian network and allows us to query the CPDs in the network The JSON file for this example, job_interview.txt, should be placed in the same folder as the IPython Notebook so that it can be loaded automatically

The following discussion uses integers 0, 1, and 2 for discrete outcomes of each

random variable, where 0 is the worst outcome For example, Interview = 0 indicates the worst outcome of the interview and Interview = 2 is the best outcome.

Trang 38

Causal reasoning

The first kind of reasoning we shall explore is called causal reasoning Initially,

we observe the prior probability of an event unconditioned by any evidence (for

this example, we shall focus on the Offer random variable) We then introduce

observations of one of the parent variables Consistent with our logical reasoning, we note that if one of the parents (equivalent to causes) of an event is observed, then we

have stronger beliefs about the child random variable (Offer).

We start by defining a function that reads the JSON data file and creates an object

we can use to run probability queries The following code is from the Bayes Causal Reasoning.ipynb IPython Notebook:

net-from libpgm.graphskeleton import GraphSkeleton

from libpgm.nodedata import NodeData

from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork from libpgm.tablecpdfactorization import TableCPDFactorization

We can now use the specificquery function to run inference queries on the

network we have defined What is the prior probability of getting a P Offer =( 1) Offer?

Note that the probability query takes two dictionary arguments: the first one being the query and the second being the evidence set, which is specified by an empty dictionary, as shown in the following code:

tcpd=getTableCPD()

tcpd.specificquery(dict(Offer='1'),dict(Grades='0'))

Trang 39

The following is the output of the preceding code:

0.35148

As expected, it decreases the probability of getting an offer since we

reason that students with poor grades are unlikely to get an offer Adding

further evidence that the candidate's experience is low as well, we evaluate

P Offer= Grades= Experience= , as shown in the following code:

tcpd=getTableCPD()

tcpd.specificquery(dict(Offer='1'),dict(Grades='0',Experience='0'))The following is the output of the preceding code:

Evidential Reasoning Job Offer

Job Interview

score

Causal Reasoning Job Offer

Degree score

Job Interview Experience

Trang 40

We now introduce evidence that the candidate's interview was good and evaluate

the value for P(Experience=1|Interview=2), as shown in the following code:

tcpd=getTableCPD()

print tcpd.specificquery(dict(Experience='1'),dict(Interview='2'))The output of the preceding code is as follows:

0.864197530864

We see that if the candidate scores well on the interview, the probability that the candidate was highly experienced increases, which follows the reasoning that the candidate must have good experience or education, or both In evidential reasoning,

we reason from effect to cause

Định dạng
Số trang	173
Dung lượng	8,34 MB