Ebook Probability for machine learning: Discover how to harness uncertainty with Python

12 3 Why Learn Probability for Machine Learning 13 3.1 Tutorial Overview.. Discover that probability distributions summarize the likelihood of events and common distribution functions fo

Tutorial Overview

This tutorial is divided into four parts; they are:

Uncertainty is Normal

Uncertainty signifies the presence of incomplete or imperfect information, a concept that contrasts with the focus on certainty and logic prevalent in mathematics and programming While software is often developed under the assumption of deterministic execution, the underlying computer hardware frequently encounters noise and errors that require constant checking and correction Achieving certainty with perfect information is rare, typically found only in controlled scenarios or theoretical examples In reality, most aspects of our lives exist on a spectrum between uncertainty and inaccuracy, reflecting the messy and imperfect nature of the world, which necessitates decision-making amidst these challenges.

Probability of an Event

In navigating uncertainty, we frequently use terms such as luck, chance, odds, likelihood, and risk to interpret our experiences To effectively reason and make inferences in an unpredictable environment, it is essential to employ principled and formal methods for problem-solving Probability serves as the fundamental language and toolkit for managing uncertainty.

Probability quantifies the likelihood of an event occurring, such as a fire in a neighborhood or a flood in a region It can also be applied to consumer behavior, like the purchase of a product To calculate the probability of an event, one can count the number of occurrences and divide it by the total number of possible outcomes, expressed as:\$$\text{probability} = \frac{\text{occurrences}}{\text{non-occurrences} + \text{occurrences}}\$$

Probability is a fractional value ranging from 0 to 1, where 0 signifies no chance of occurrence and 1 indicates certainty The total probability of all possible events equals one When all outcomes are equally likely, the probability of each event is calculated as 1 divided by the total number of outcomes For instance, when rolling a fair die, each number from 1 to 6 has an equal probability of \$\frac{1}{6}\$ or approximately 0.166 of being rolled.

Probability, commonly represented as a lowercase $ p $, can be expressed as a percentage by multiplying the value by 100; for instance, a probability of 0.3 translates to 30% (0.3 × 100) A 50% probability, often referred to as a 50-50 chance, indicates that the event is expected to occur half of the time Additionally, the probability of an event, such as a flood, is typically denoted by an uppercase $ P $ in the context of a probability function.

It is also sometimes written as a function of lowercase p or P r For example: p(f lood) or

P r(f lood) The complement of the probability can be stated as one minus the probability of the event For example:

1−P(f lood) = probability of no flood (1.3)

Probability, often called the odds or chance of an event, quantifies the likelihood of that event occurring While these terms are generally synonymous, "odds" specifically denotes the ratio of wins to losses, expressed as w:l; for example, 1:3 indicates 1 win for every 3 losses, translating to a 25% probability of winning Although we have discussed naive probability, probability theory encompasses a broader and more general framework.

Probability Theory

Probability extends logic to quantify and manage uncertainty This field, known as probability theory, distinguishes itself from the likelihood of individual events.

Two Schools of Probability

Probability extends logic to address uncertainty, offering formal rules to assess the likelihood of a proposition being true based on the probabilities of other related propositions.

— Page 56,Deep Learning, 2016. Probability theory has three important concepts:

Event (A) An outcome to which a probability is assigned.

Sample Space (S) The set of possible outcomes or events.

Probability Function (P) The function used to assign a probability to an event.

The probability of an event (A) occurring within a sample space (S) is defined by a probability function (P), which shapes the probability distribution of all events Common distributions include uniform, where all events are equally likely, and Gaussian, characterized by a normal bell shape Probability serves as a fundamental concept in various applied mathematics fields, particularly statistics, and underpins advanced disciplines such as physics, biology, and computer science.

Probability can be interpreted in two primary ways: Frequentist probability, which views it as the actual likelihood of an event occurring, and Bayesian probability, which reflects the strength of belief in the occurrence of an event Both approaches are not mutually exclusive; rather, they are complementary, offering distinct and valuable techniques for understanding probability.

The frequentist approach to probability is grounded in objectivity, relying on the observation and counting of events This method uses the frequencies of these events to directly calculate probabilities, which is reflected in its name Originally, probability theory was established to analyze event frequencies.

— Page 55,Deep Learning, 2016.Methods from frequentist probability include p-values and confidence intervals used in statistical inference and maximum likelihood estimation for parameter estimation.

Further Reading

The Bayesian approach to probability is inherently subjective, as it assigns probabilities to events based on both evidence and personal belief, grounded in Bayes’ theorem This methodology enables the assignment of probabilities to rare events and those that have not yet been observed, distinguishing it from frequentist probability.

One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not have long term frequencies.

— Page 27,Machine Learning: A Probabilistic Perspective, 2012.

Methods from Bayesian probability include Bayes factors and credible interval for inference and Bayes estimator and maximum a posteriori estimation for parameter estimation.

This section provides more resources on the topic if you are looking to go deeper.

Probability Theory: The Logic of Science, 2003. https://amzn.to/2lnW2pp

Introduction to Probability, 2nd edition, 2019. https://amzn.to/2xPvobK

Introduction to Probability, 2nd edition, 2008. https://amzn.to/2llA3PR

Uncertainty, Wikipedia. https://en.wikipedia.org/wiki/Uncertainty

Probability, Wikipedia. https://en.wikipedia.org/wiki/Probability

Odds, Wikipedia. https://en.wikipedia.org/wiki/Odds

Probability theory, Wikipedia. https://en.wikipedia.org/wiki/Probability_theory

Summary

In this tutorial, you discovered a gentle introduction to probability Specifically, you learned:

Certainty is unusual and the world is messy, requiring operating under uncertainty.

Probability quantifies the likelihood or belief that an event will occur.

Probability theory is the mathematics of uncertainty.

In the next tutorial, you will discover the source of uncertainty in machine learning.

Applied machine learning involves navigating various sources of uncertainty, such as data value variance, sample collection discrepancies, and the limitations of developed models Effectively managing this uncertainty is crucial for predictive modeling and can be accomplished using tools and techniques from probability, which is specifically designed to address uncertainty This tutorial will guide you through the challenges of uncertainty in machine learning, equipping you with the knowledge to tackle these issues effectively.

Uncertainty is the biggest source of difficulty for beginners in machine learning, especially developers.

Noise in data, incomplete coverage of the domain, and imperfect models provide the three main sources of uncertainty in machine learning.

Probability provides the foundation and tools for quantifying, handling, and harnessing uncertainty in applied machine learning.

Tutorial Overview

This tutorial is divided into five parts; they are:

3 Incomplete Coverage of the Domain

4 Imperfect Model of the Problem

Uncertainty in Machine Learning

Applied machine learning involves navigating uncertainty, which refers to working with incomplete or imperfect information This uncertainty is a core challenge for beginners, particularly those transitioning from a software engineering background, where computers operate deterministically In traditional programming, developers create algorithms that are evaluated based on their time or space complexity to optimize performance In contrast, predictive modeling in machine learning requires fitting models to correlate input examples with outputs, such as numerical values in regression or class labels in classification tasks.

What are the best features that I should use?

What is the best algorithm for my dataset?

The answers to these questions are unknown and might even be unknowable, at least exactly.

In the field of computer science, many disciplines focus on deterministic and certain entities Despite the relatively clean and predictable environments in which computer scientists and software engineers operate, it is intriguing that machine learning relies significantly on probability theory.

Beginners often face significant challenges due to uncertainty in machine learning, which leads to unknown answers The key to overcoming this difficulty lies in systematically evaluating various solutions to identify a suitable set of features or algorithms for specific prediction problems In the following sections, we will explore the three primary sources of uncertainty in machine learning.

Noise in Observations

Observations in a given domain are often noisy rather than precise, with each observation, or instance, representing a single row of data that reflects what was measured or collected This data describes the object or subject and serves as both the input to a model and the expected output For instance, a set of measurements of an iris flower along with its species exemplifies training data.

Listing 2.1: Example of measurements and expected class label.

In the case of new data for which a prediction is to be made, it is just the measurements without the species of flower.

Incomplete Coverage of the Domain

Listing 2.2: Example of measurements and missing class label.

Noise in data refers to variability in observations, which can arise from natural differences, such as the size of a flower, or from errors like measurement slips or typos This variability affects both inputs and outputs, potentially leading to incorrect classifications Given that real-world data is often messy and imperfect, practitioners must approach data with skepticism and develop systems to manage and leverage this uncertainty Consequently, significant effort is dedicated to reviewing data statistics and creating visualizations to identify and address unusual cases, a process known as data cleaning.

2.4 Incomplete Coverage of the Domain

Observations used to train a model are inherently a sample and incomplete In statistics, a random sample is defined as a collection of observations selected from a domain without systematic bias However, limitations will always introduce some degree of bias For instance, measuring the size of randomly selected flowers from a single garden may yield random selections, but the scope is restricted to that one garden Expanding the scope to include gardens within a city, across a country, or even across a continent can help mitigate this bias.

To ensure a machine learning model is effective, it is crucial to achieve an appropriate balance of variance and bias in the sample, making it representative of the intended task We strive to gather a random sample of observations for training and evaluation, but often have limited control over the sampling process, relying instead on existing databases or CSV files Consequently, we will never have access to all observations, as the need for a predictive model arises from the existence of unobserved cases within the problem domain Despite our efforts to enhance model generalization, we can only hope to cover the training dataset and the most significant unobserved cases.

To address the uncertainty in dataset representativeness and to evaluate the performance of a modeling procedure on unseen data, we divide a dataset into training and testing sets or employ resampling techniques such as k-fold cross-validation.

Imperfect Model of the Problem

A machine learning model will always have some error This is often summarized as all models are wrong, or more completely in an aphorism by George Box:

All models are wrong but some are useful

How to Manage Uncertainty

The entire process of model preparation, including data selection, training hyperparameters, and interpretation of predictions, is crucial Model errors can manifest as inaccurate predictions in regression tasks or mismatched class labels Such prediction errors are anticipated due to the inherent uncertainty in the data, which includes observational noise and incomplete domain coverage.

An error of omission occurs when we exclude details or generalize information to apply to new situations This is often done by opting for simpler, more robust models rather than complex ones that are tailored to the training data Consequently, we may select a model that is known to perform poorly on the training dataset, anticipating that it will generalize more effectively to new cases and ultimately deliver superior overall performance.

In many situations, opting for a straightforward yet uncertain rule is often more practical than relying on a complex but certain one, even when the actual rule is deterministic and our modeling system can support a more intricate approach.

Predictions are essential, and since models are prone to errors, we address this uncertainty by choosing a sufficiently effective model This choice is typically based on the model's relative performance, demonstrating skillfulness compared to naive methods or other established learning models.

Uncertainty in applied machine learning is managed using probability Probability is the field of mathematics designed to handle, manipulate, and harness uncertainty.

Uncertainty is a fundamental concept in pattern recognition, stemming from measurement noise and the limited size of data sets Probability theory offers a robust framework for quantifying and managing this uncertainty, serving as a cornerstone for effective pattern recognition.

— Page 12,Pattern Recognition and Machine Learning, 2006.

In fact, probability theory is central to the broader field of artificial intelligence.

Agents can handle uncertainty by using the methods of probability and decision theory, but first they must learn their probabilistic theories of the world from experience.

— Page 802, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Probability methods and tools form the essential foundation for understanding the random or stochastic aspects of predictive modeling in machine learning.

Further Reading

In terms of noisy observations, probability and statistics help us to understand and quantify the expected value and variability of variables in our observations from the domain.

In terms of the incomplete coverage of the domain, probability helps to understand and quantify the expected distribution and density of observations in the domain.

In terms of model error, probability helps to understand and quantify the expected capability and variance in performance of our predictive models when applied to new data.

Probability is essential for the iterative training of various machine learning models, including maximum likelihood estimation, which underpins techniques like linear regression, logistic regression, and artificial neural networks It also serves as the foundation for specific algorithms, such as Naive Bayes, and contributes to entire subfields in machine learning, including graphical models like the Bayesian Belief Network.

Probabilistic methods form the basis of a plethora of techniques for data mining and machine learning.

— Page 336, Data Mining: Practical Machine Learning Tools and Techniques 4th edition, 2016.

In applied machine learning, the selection of procedures is meticulously designed to tackle the identified sources of uncertainty To fully grasp the rationale behind these choices, a fundamental comprehension of probability and probability theory is essential.

Chapter 3: Probability Theory, Deep Learning, 2016. https://amzn.to/2lnc3vL

Chapter 2: Probability, Machine Learning: A Probabilistic Perspective, 2012. https://amzn.to/2xKSTCP

Chapter 2: Probability Distributions, Pattern Recognition and Machine Learning, 2006. https://amzn.to/2JwHE7I

Uncertainty, Wikipedia. https://en.wikipedia.org/wiki/Uncertainty

Stochastic, Wikipedia. https://en.wikipedia.org/wiki/Stochastic

All models are wrong, Wikipedia. https://en.wikipedia.org/wiki/All_models_are_wrong

Summary

In this tutorial, you discovered the challenge of uncertainty in machine learning Specifically, you learned:

Uncertainty is the biggest source of difficulty for beginners in machine learning, especially developers.

Noise in data, incomplete coverage of the domain, and imperfect models provide the three main sources of uncertainty in machine learning.

Probability provides the foundation and tools for quantifying, handling, and harnessing uncertainty in applied machine learning.

In the next tutorial, you will discover why probability is so important in machine learning.

Why Learn Probability for Machine

Probability quantifies uncertainty and is a fundamental aspect of machine learning While often recommended as a prerequisite, understanding probability is more beneficial when contextualized within the applied machine learning process This tutorial will highlight the importance of studying probability for machine learning practitioners, enhancing their skills and capabilities After reading, you will gain insights into the practical applications of probability in machine learning.

Not everyone should learn probability; it depends where you are in your journey of learning machine learning.

Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.

The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Tutorial Overview

This tutorial is divided into seven parts; they are:

1 Reasons to NOT Learn Probability

2 Class Membership Requires Predicting a Probability

3 Some Algorithms Are Designed Using Probability

4 Models Are Trained Using a Probabilistic Framework

5 Models Can Be Tuned With a Probabilistic Framework

6 Probabilistic Measures Are Used to Evaluate Model Skill

Reasons to NOT Learn Probability

3.2 Reasons to NOT Learn Probability

While there are compelling reasons to learn probability, beginners in applied machine learning may find it unnecessary to focus on this topic initially.

Understanding the abstract theory behind machine learning algorithms is not necessary to effectively utilize machine learning as a problem-solving tool.

Delaying your entry into machine learning by spending months or even years studying an entire related field can hinder your ability to tackle predictive modeling problems effectively.

It’s a huge field Not all of probability is relevant to theoretical machine learning, let alone applied machine learning.

I recommend starting with a breadth-first approach in applied machine learning, which I refer to as the results-first approach This method involves learning and practicing the complete steps of a predictive modeling problem using tools like scikit-learn and Pandas in Python By doing so, you establish a foundational understanding that allows you to progressively deepen your knowledge of algorithms and the underlying mathematics Once you are familiar with the predictive modeling process, it is essential to enhance your understanding of probability.

Class Membership Requires Predicting a Probability

Classification predictive modeling involves assigning a specific label to an example A well-known instance of this is the iris flowers dataset, which includes four measurements of a flower The objective is to classify each observation into one of three recognized species of iris flowers, effectively modeling the problem by directly assigning a class label to each observation.

A more common approach is to frame the problem as a probabilistic class membership, where the probability of an observation belonging to each known class is predicted.

Output: Probability of membership to each iris species.

Framing the problem as a prediction of class membership simplifies modeling and enhances the model's learning capability This approach enables the model to capture data ambiguity, allowing users to interpret probabilities within their specific context By selecting the class with the highest probability, these probabilities can be converted into definitive class labels Additionally, a probability calibration process can scale or transform these probabilities Understanding this class membership framing is essential for interpreting the model's predictions effectively.

Some Algorithms Are Designed Using Probability

3.4 Some Algorithms Are Designed Using Probability

Algorithms designed to leverage probability tools and methods include individual algorithms such as the Naive Bayes algorithm, which is based on Bayes Theorem and incorporates simplifying assumptions.

It also extends to whole fields of study, such as probabilistic graphical models, often called graphical models or PGM for short, and designed around Bayes Theorem.

A notable graphical model is Bayesian Belief Networks, or Bayes Nets, which are capable of capturing the conditional dependencies between variables.

Models Are Trained Using a Probabilistic Framework

Many machine learning models utilize an iterative algorithm based on a probabilistic framework, with maximum likelihood estimation (MLE) being one of the most prevalent MLE is used to estimate model parameters, such as weights, from observed data and serves as the foundation for the ordinary least squares estimate in linear regression The expectation-maximization algorithm is also a key component in this context.

EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.g estimating k means for k clusters, also known as the k-Means clustering algorithm.

Maximum likelihood estimation is essential for models predicting class membership, as it minimizes the divergence between observed and predicted probability distributions This approach is integral to classification algorithms such as logistic regression and deep learning neural networks During training, the difference in probability distributions is often measured using entropy, particularly through cross-entropy These concepts, including entropy and KL divergence, stem from information theory and are grounded in probability theory, with entropy defined as the negative log of the probability.

Models Can Be Tuned With a Probabilistic Framework

Tuning hyperparameters is essential in machine learning models, like adjusting k in kNN or the learning rate in neural networks Common methods include grid searching and random sampling of hyperparameter combinations However, Bayesian optimization offers a more efficient alternative by conducting a directed search through the configuration space, focusing on those configurations that are likely to enhance performance This method is grounded in Bayes Theorem, which guides the sampling of potential configurations.

Probabilistic Measures Are Used to Evaluate Model Skill

3.7 Probabilistic Measures Are Used to Evaluate Model

To evaluate algorithms that predict probabilities, various performance measures are essential Common aggregate measures include log loss and Brier score, which summarize model performance In binary classification tasks, Receiver Operating Characteristic (ROC) curves help analyze different cut-off points, leading to various trade-offs in interpretation Additionally, the area under the ROC curve (ROC AUC) serves as another aggregate measure Understanding and interpreting these scoring methods necessitates a solid grasp of probability theory.

One More Reason

Learning probability can be enjoyable, especially when taught through practical examples and executable code Engaging with real data helps develop an intuition for a subject that is often perceived as abstract, making the learning experience both fun and insightful.

Further Reading

Pattern Recognition and Machine Learning, 2006. https://amzn.to/2JwHE7I

Machine Learning: A Probabilistic Perspective, 2012. https://amzn.to/2xKSTCP

Machine Learning, 1997. https://amzn.to/2jWd51p

Graphical model, Wikipedia. https://en.wikipedia.org/wiki/Graphical_model

Maximum likelihood estimation, Wikipedia. https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

Expectation-maximization algorithm, Wikipedia. https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

Summary

Cross entropy, Wikipedia. https://en.wikipedia.org/wiki/Cross_entropy

Kullback-Leibler divergence, Wikipedia. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Bayesian optimization, Wikipedia. https://en.wikipedia.org/wiki/Bayesian_optimization

In this tutorial, you discovered why, as a machine learning practitioner, you should deepen your understanding of probability Specifically, you learned:

Not everyone should learn probability; it depends where you are in your journey of learning machine learning.

Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.

The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

This was the final tutorial in this Part In the next Part, you will discover the different types of probability.

Probability measures the uncertainty of outcomes for random variables, and while calculating the probability for a single variable is straightforward, machine learning often involves multiple interacting random variables Techniques such as joint, marginal, and conditional probability are essential for quantifying the relationships between these variables, forming the foundation for probabilistic modeling in data analysis This tutorial offers a gentle introduction to these concepts, ensuring that by the end, you will have a clear understanding of joint, marginal, and conditional probability for multiple random variables.

Joint probability is the probability of two or more events occurring simultaneously.

Marginal probability is the probability of an event irrespective of the outcome of other variables.

Conditional probability is the probability of one event occurring in the presence of one or more other events.

Tutorial Overview

This tutorial is divided into three parts; they are:

1 Probability for One Random Variable

2 Probability for Multiple Random Variables

3 Probability for Independence and Exclusivity

Probability for One Random Variable

4.2 Probability for One Random Variable

Probability measures the chance of an event occurring, determining the likelihood of specific outcomes for random variables like coin flips, dice rolls, or drawing cards from a deck.

Probability gives a measure of how likely it is for something to happen.

— Page 57, Probability: For the Enthusiastic Beginner, 2016.

For a random variable x, P(x) is a function that assigns a probability to all possible values of x.

The probability of a specific event A for a random variable x is denoted as P(x =A), or simply as P(A).

Probability is calculated as the number of desired outcomes divided by the total possible outcomes, in the case where all outcomes are equally likely.

Probability = number of desired outcomes total number of possible outcomes (4.3)

The probability of rolling a specific number, such as a 5, on a die is determined by dividing the number of favorable outcomes (1) by the total possible outcomes (6), resulting in a probability of \$\frac{1}{6}\$ or approximately 16.666% It is essential that the sum of the probabilities of all possible outcomes equals one to ensure the validity of the probability distribution.

Sum of the Probabilities for All Outcomes = 1.0 (4.4)

The probability of an impossible outcome is zero For example, it is impossible to roll a 7 with a standard six-sided die.

The probability of a certain outcome is one For example, it is certain that a value between

1 and 6 will occur when rolling a six-sided die.

The complement of an event represents the probability of that event not occurring, which can be determined by subtracting the event's probability from one, expressed as $1 - P(A)$ For instance, the likelihood of not rolling a 5 is calculated as $1 - P(5)$, resulting in approximately 0.833 or 83.333%.

Now that we are familiar with the probability of one random variable, let’s consider probability for multiple random variables.

Probability for Multiple Random Variables

4.3 Probability for Multiple Random Variables

In machine learning, we frequently encounter numerous random variables For instance, in a data table like Excel, each row signifies a distinct observation or event, while each column corresponds to a different random variable These variables can be classified as discrete, which have a finite number of possible values, or continuous, which can assume any real or numerical value.

As such, we are interested in the probability across two or more random variables.

Understanding the interaction of random variables is complex, as their relationships significantly affect their probabilities To simplify this, we can focus on two random variables, X and Y, while recognizing that these principles can extend to multiple variables Additionally, we can examine the probability of two specific events, one for each variable (X = A, Y = B), although it is equally valid to consider groups of events for each variable.

In this article, we will explore the probability of multiple random variables, specifically focusing on the relationship between event A and event B, denoted as X = A and Y = B We will assume that these two variables are interconnected or dependent Consequently, we will examine three primary types of probability that are relevant to this discussion.

Joint Probability: Probability of events A and B.

Marginal Probability: Probability of eventA given variable Y.

Conditional Probability: Probability of event A given event B.

These types of probability form the basis of much of predictive modeling with problems such as classification and regression For example:

The probability of a row of data is the joint probability across each input variable.

The probability of a specific value of one input variable is the marginal probability across the values of the other input variables.

The predictive model itself is an estimate of the conditional probability of an output given an input example.

Joint, marginal, and conditional probability are foundational in machine learning Let’s take a closer look at each in turn.

4.3.1 Joint Probability for Two Variables

The joint probability refers to the likelihood of two or more simultaneous events, such as the outcomes of different random variables This concept is encapsulated in the joint probability distribution, which formally expresses the joint probability of events A and B.

Theandor conjunction is denoted using the upside down capitalU operator (∩) or sometimes a comma (,).

4.3 Probability for Multiple Random Variables 22

The joint probability for events A andB is calculated as the probability of event A given event B multiplied by the probability of event B This can be stated formally as follows:

The joint probability, often referred to as the fundamental or product rule of probability, is defined as P(A given B), which represents the conditional probability of event A occurring given that event B has taken place Notably, joint probability is symmetrical, indicating that P(A∩B) equals P(B∩A).

We can analyze the probability of a specific event occurring for one random variable, regardless of the outcomes of another random variable For instance, we can determine the probability of $X = A$ across all possible outcomes.

Marginal probability, also known as marginal distribution, refers to the likelihood of a single event occurring while considering all or a subset of outcomes from another random variable It specifically denotes the probability distribution of one random variable when additional random variables are present.

Marginal probability refers to the sum of probabilities for one variable, represented in a table format with variable X as columns and variable Y as rows In this context, the marginal probability of variable X is calculated by summing the probabilities of all outcomes for variable Y, which are displayed along the margin of the table There is no specific notation for marginal probability; it simply involves aggregating the probabilities of all events related to the second variable while holding the first variable constant.

The sum rule is a fundamental principle in probability that distinguishes marginal probability from conditional probability Unlike conditional probability, which focuses on a single event, marginal probability accounts for the union of all events related to the second variable.

Conditional probability refers to the likelihood of an event occurring based on the occurrence of another event It is expressed as the conditional probability distribution when considering one or more random variables For instance, the conditional probability of event A given event B is formally denoted as P(A|B).

The given is denoted using the pipe (|) operator; for example:

Probability for Independence and Exclusivity

This calculation presumes that the probability of event B is greater than zero, indicating that it is possible The concept of event A occurring given event B does not imply that event B has definitely happened; rather, it refers to the likelihood of event A occurring in the context of event B for a specific trial.

4.4 Probability for Independence and Exclusivity

In the analysis of multiple random variables, it is essential to recognize that they may be independent, meaning their outcomes do not influence one another Alternatively, these variables might interact, yet their events can be exclusive, occurring at different times This section will explore the probabilities associated with multiple random variables under these conditions.

Statistical independence occurs when one variable does not depend on another, affecting the calculation of their probabilities For independent events A and B, the joint probability is equivalent to the product of their individual probabilities Thus, the joint probability of independent events can be expressed as the probability of event A multiplied by the probability of event B.

The marginal probability of an event for an independent random variable is essentially the probability of that event itself This concept reflects the familiar notion of probability associated with a single random variable.

The marginal probability of an independent event is referred to as the probability itself Likewise, when two variables are independent, the conditional probability of A given B is simply the probability of A, since the occurrence of B does not influence it.

Statistical independence is a key concept in sampling, where each sample is considered unaffected by previous samples and does not influence future ones Many machine learning algorithms rely on the assumption that samples are independent and identically distributed (i.i.d.), meaning they come from the same probability distribution.

Further Reading

Mutually exclusive events are those where the occurrence of one event prevents the occurrence of others These events are disjoint, indicating that they cannot interact and are strictly independent If event A is mutually exclusive with event B, the joint probability of both events occurring simultaneously is zero.

Instead, the probability of an outcome can be described as event A or event B, stated formally as follows:

The or is also called a union and is denoted as a capital U letter (∪); for example:

When dealing with non-mutually exclusive events, we focus on the likelihood of either event occurring The probability of such events is determined by adding the probability of event A to the probability of event B, then subtracting the probability of both events happening at the same time.

Probability: For the Enthusiastic Beginner, 2016. https://amzn.to/2jULJsu

Notation in probability and statistics, Wikipedia. https://en.wikipedia.org/wiki/Notation_in_probability_and_statistics

Independence (probability theory), Wikipedia. https://en.wikipedia.org/wiki/Independence_(probability_theory)

Summary

Independent and identically distributed random variables, Wikipedia. https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_ variables

Mutual exclusivity, Wikipedia. https://en.wikipedia.org/wiki/Mutual_exclusivity

Marginal distribution, Wikipedia. https://en.wikipedia.org/wiki/Marginal_distribution

Joint probability distribution, Wikipedia. https://en.wikipedia.org/wiki/Joint_probability_distribution

Conditional probability, Wikipedia. https://en.wikipedia.org/wiki/Conditional_probability

In this tutorial, you discovered a gentle introduction to joint, marginal, and conditional probability for multiple random variables Specifically, you learned:

Joint probability is the probability of two or more events occurring simultaneously.

Marginal probability is the probability of an event irrespective of the outcome of other variables.

Conditional probability is the probability of one event occurring in the presence of one or more other events.

In the next tutorial, you will discover how to develop an intuition for the different types of probability with worked examples.

Intuition for Joint, Marginal, and

Understanding probability for a single random variable is simple, but it becomes more complex with two or more variables Key concepts include joint probability, which assesses the likelihood of two events occurring simultaneously; conditional probability, which evaluates the probability of one event given that another has occurred; and marginal probability, which considers the probability of an event independently of other variables While these definitions are clear, grasping their implications may require time and practical examples This tutorial will help you develop a deeper understanding of joint, marginal, and conditional probability.

How to calculate joint, marginal, and conditional probability for independent random variables.

How to collect observations from joint random variables and construct a joint probability table.

How to calculate joint, marginal, and conditional probability from a joint probability table.

Tutorial Overview

1 Joint, Marginal, and Conditional Probabilities

2 Probabilities of Rolling Two Dice

3 Probabilities of Weather in Two Cities

Joint, Marginal, and Conditional Probabilities

5.2 Joint, Marginal, and Conditional Probabilities

Calculating probability is relatively straightforward when working with a single random variable.

When dealing with two or more random variables, the analysis becomes more intriguing, reflecting many real-world situations There are three primary types of probabilities that are essential to calculate in the context of multiple random variables.

Joint Probability The probability of simultaneous events.

Marginal Probability The probability of an event irrespective of the other variables.

Conditional Probability The probability of events given the presence of other events.

The meaning and calculation of these different types of probabilities vary depending on whether the two random variables are independent (simpler) or dependent (more complicated).

This article delves into the calculation and interpretation of three types of probability, featuring worked examples for clarity We will first examine the independent rolls of two dice, followed by an analysis of weather events in two geographically close cities.

Probabilities of Rolling Two Dice

To understand joint and marginal probabilities, it's beneficial to begin with independent random variables, as their calculations are straightforward For instance, rolling a fair die results in a probability of \$\frac{1}{6}\$ or approximately 16.67% for any specific number between 1 and 6 to appear.

When rolling a second die, each value on that die has an equal probability The outcomes of the first die and the second die are independent events, meaning that the result of one die does not influence the other.

P(dice2∈ {1,2,3,4,5,6}) = 1.0 (5.2) First, using exclusivity, we can calculate the probability of rolling an even number for dice1 as the sum of the probabilities of rolling a 2, 4, or 6, for example:

5.3 Probabilities of Rolling Two Dice 28

The joint probability of rolling an even number with both dice simultaneously is calculated by considering the independent probabilities of each die This results in a joint probability of 0.5, or 50%, which aligns with our intuitive expectations.

The probability of rolling an even number on two dice is determined by multiplying the individual probabilities for each die The outcome of the first die influences the likelihood of the second die's result.

The probability of rolling an even number on a single die is 0.5, leading to a combined probability of 0.25 (or 25%) when rolling two dice Each die has 6 possible outcomes, resulting in 36 total combinations when rolling two dice Out of these, 9 combinations yield an even number on both dice, confirming that the probability of rolling an even number on each die is indeed 25%.

If you are ever in doubt of your probability calculations when working with independent variables with discrete events, think in terms of combinations and things will make sense again.

We can create a joint probability table using our understanding of the domain, with dice1 represented on the x-axis and dice2 on the y-axis Each cell's joint probability is determined using the joint probability formula, such as 0.166 × 0.166, resulting in a value of approximately 0.027 or 2.777%.

Listing 5.1: Example of the joint probability table for rolling two die.

This table illustrates the joint probability distribution of two random variables, dice1 and dice2, providing insights into joint and marginal probabilities of independent variables For instance, the joint probability of rolling a 2 on both dice is 2.777%, which can be directly obtained from the table Additionally, we can analyze more complex scenarios, such as rolling a 2 on dice1 while rolling an odd number on dice2, by summing the probabilities from the relevant rows and columns.

The marginal probability of rolling a specific number on one die, such as a 2 on dice1 or a 6 on dice2, can be determined by summing the probabilities from the corresponding column or row in a probability table For instance, the marginal probability of rolling a 6 with dice2 can be calculated by adding the probabilities from the entire column associated with that outcome, resulting in approximately 0.083 or 8.333%.

Probabilities of Weather in Two Cities

across the final row of the table This comes out to be about 0.166 or 16.666% as we may intuitively expect.

In a table of joint probabilities, the total sum of probabilities for all cells must equal 1.0 Additionally, the sum of probabilities for each row and each column should also equal 1.0 This requirement ensures the integrity of the probability distribution Since the events are independent, calculating conditional probabilities does not require any special considerations.

For example, the probability of rolling a 2 with dice1 is the same regardless of what was rolled with dice2.

Conditional probability lacks significance for independent random variables Creating a joint probability table is an effective method for understanding the calculation and exploration of joint and marginal probabilities In the following section, we will examine a more complex example involving dependent random variables.

5.4 Probabilities of Weather in Two Cities

Understanding joint and marginal probabilities can be enhanced by examining a table of joint probabilities for two dependent random variables, such as the weather in two cities, city1 and city2 These cities experience similar weather patterns due to their proximity, yet they do not have identical conditions For instance, when city1 is sunny, city2 is often sunny as well, indicating a dependency in their weather This scenario provides a practical context to explore various types of probability.

First, we can record the observed weather in each city over twenty days For example, on day 1, what was the weather in each, day 2, and so on.

Listing 5.2: Example of collected data for the weather in two cities.

For brevity, the complete results table is omitted, but we will summarize the totals later We will calculate the total number of paired weather events observed, such as the frequency of sunny days in city1 and city2, as well as the occurrences of sunny days in city1 paired with cloudy days in city2, among other combinations.

City 1 | City 2 | Total sunny sunny 6/20

5.4 Probabilities of Weather in Two Cities 30 sunny cloudy 1/20 sunny rainy 0/20

Listing 5.3: Example of the frequency of weather in two cities.

The complete table is not included for conciseness, but we will calculate the totals later This information serves as a foundation for analyzing the likelihood of weather events in the two cities.

We can analyze the probability of weather events in two cities by creating a table that summarizes the joint probabilities of discrete weather occurrences This table features city1 along the top (columns) and city2 along the side (rows), providing a clear overview of the likelihood of various paired weather events.

Listing 5.4: Example of the joint probabilities for weather in two cities.

The table presents the joint probability of weather events occurring in two cities, with the total of all joint probabilities equaling 1.0 By analyzing this table, we can determine the likelihood of simultaneous weather conditions, such as the high probability of both cities experiencing sunny weather at the same time.

The probability of rain in the first city is 20%, while the likelihood of rain in the second city is 30% Additionally, we can analyze the scenario where it does not rain in the first city but does rain in the second city.

P(city1∈ {sunny∪cloudy} ∩city2=rainy) (5.11)

Again, we can calculate this directly from the table Firstly, P(sunny,rainy) is 20 0 and

The probability of experiencing both cloudy and rainy weather is 0.05, indicating that while it is possible, it is not very likely Additionally, the table provides insights into the marginal distribution of weather events For instance, to determine the probability of a sunny day in city1, one can simply sum the probabilities listed in the first column for sunny weather, regardless of the conditions in city2.

P(city1=sunny) =P(city1=sunny∩city2=sunny)+

5.4 Probabilities of Weather in Two Cities 31 Or:

The marginal probability of a sunny day in city1 is 35% Similarly, for city2, the marginal probability can be determined by summing the probabilities across the relevant row, such as calculating the probability of a rainy day by adding the probabilities from the bottom row of the table.

The marginal probability of experiencing a rainy day in city2 is 20% Marginal probabilities are valuable and informative, making it beneficial to update the joint probability table to incorporate these figures.

Listing 5.5: Example of the joint and marginal probabilities for weather in two cities.

Conditional probability allows us to determine the likelihood of a weather event occurring in one city based on the occurrence of a similar event in another city This probability can be computed using joint and marginal probabilities.

For example, we might be interested in the probability of it being sunny in city1, given that it is sunny in city2 This can be stated formally as:

P(city1=sunny|city2=sunny) = P(city1=sunny∩city2=sunny)

We can fill in the joint and marginal probabilities from the table in the previous section; for example:

Further Reading

The probability of it being sunny in city1 when it is sunny in city2 is 75%, indicating a strong correlation between the two cities' weather In contrast, the joint probability of both cities being sunny on the same day is only 30% This difference can be better understood by examining the various combinations of weather patterns.

In this conditional scenario, we have more information, allowing us to focus on 8 days instead of calculating probabilities over all 20 days Knowing that it is sunny in city2, we find that 6 out of those 8 days were also sunny in city1, resulting in a probability of 75% This information can be derived from the table of joint probabilities It's crucial to understand that conditional probability is not reversible, a common misconception.

The probability of sunny weather in city1, given that it is sunny in city2, differs from the probability of sunny weather in city2, given that it is sunny in city1.

P(city1=sunny|city2=sunny)6=P(city2=sunny|city1=sunny) (5.19)

In this case, the probability of it being sunny in city2 given that it is sunny in city1 is calculated as follows:

P(city2=sunny|city1=sunny) = P(city2=sunny∩city1=sunny)

In this case, it is higher, at about 85.714% We can also use the conditional probability to calculate the joint probability.

To calculate the joint probability of sunny weather in city2, given that it is sunny in city1, we can use the conditional probability of sunny in city2 given sunny in city1 along with the marginal probability of sunny in city2.

P(city1=sunny∩city2=sunny) =P(city2=sunny|city1=sunny)×P(city1=sunny)

This gives 0.3 or 30% as we expected.