Supervised Machine Learning Lecture notes for the Statistical Machine Learning course Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, Thomas B Schön Version March 12, 2019 Department of Informat.
What is machine learning all about?
Machine learning gives computers the ability to learn from data without being explicitly programmed; learning happens when data is combined with mathematical models to estimate the values of unknown parameters, enabling the model to fit observed patterns While the simplest case is fitting a straight line, machine learning relies on far more flexible models to capture complex relationships, with the aim of generalization—the ability to make accurate predictions on new, unseen data For instance, a model trained on a data set of 1,000 puppy images can, if well chosen, determine whether another image depicts a puppy even though it wasn’t part of the training set.
The science of machine learning is about learning models that generalize well.
These notes cover supervised learning, the setting where data come as labeled pairs {(x_i, y_i)} for i = 1 to n, with x_i representing inputs and y_i the corresponding outputs that explain the data In this framework we rely on labeled data, meaning each data point has an input x_i and an output y_i that reveals what we see in the data A medical example illustrates this: an electrocardiogram (ECG) measures the heart’s electrical activity, where the ECG readings are the inputs x and the doctor’s diagnosis is the output y With a sufficiently large labeled dataset of ECGs and diagnoses, supervised learning trains a model to learn the relationship between x and y Once trained, the model can diagnose new ECG readings, producing a prediction y_hat If the model’s predictions on new ECGs align closely with the true diagnoses, the model generalizes well.
Supervised learning relies on labeled data—pairs (x_i, y_i)—which is often expensive and sometimes impossible because humans must interpret inputs to provide correct outputs This labeling bottleneck is compounded by the fact that many state-of-the-art methods require large data volumes to achieve strong performance To address this, unsupervised learning uses only the input data x_i (unlabeled data) A central subproblem here is clustering, automatically organizing data into groups based on similarity A growing middle ground, semi-supervised learning, blends labeled and unlabeled data to improve learning when labeled data are scarce.
Typically we have access to large volumes of unlabeled data, while labeled data remains scarce; that small but valuable set of labeled examples can significantly boost learning when used together with the much larger unlabeled dataset.
Reinforcement learning, a branch of machine learning, goes beyond using measured data to merely predict outcomes or interpret a situation Instead, it trains an agent to learn optimal behavior by interacting with its environment through trial and error and receiving feedback in the form of rewards The objective is to maximize cumulative rewards over time by choosing actions that lead to favorable long-term consequences, resulting in robust policies that adapt to changing conditions Core methods include model-free approaches such as Q-learning and policy gradient techniques, as well as model-based methods that simulate environmental dynamics Deep reinforcement learning extends these ideas with neural networks to handle high-dimensional states, unlocking capabilities in robotics, autonomous systems, game playing, and complex optimization tasks By emphasizing sequential decision-making, exploration versus exploitation, and continuous learning, reinforcement learning enables systems to improve their performance through experience and generalize to new tasks.
1 Some common synonyms used for the input variable include feature, predictor, regressor, covariate, explanatory variable, controlled variable and independent variable.
Common synonyms for the output variable include response, regressand, label, explained variable, predicted variable, and dependent variable To build systems that act in the real world, we typically train them to take actions by maximizing a reward that promotes the desired state of the environment This reinforcement learning approach has strong ties to control theory A newer area, causal learning, addresses the harder task of learning cause-and-effect relationships, moving beyond the correlations that characterize much of traditional machine learning In causal learning the goal is to identify causal relations rather than merely learning associations.
Regression and classification
Supervised machine learning algorithms are most effectively categorized by the type of output variable they predict In general, the target can be quantitative (numerical values, either continuous or discrete) or qualitative (categorical or ordinal) Distinguishing these data types helps determine the appropriate modeling approach, with regression methods applied to quantitative targets and classification methods used for qualitative targets See Table 1.1 for a few examples.
Table 1.1:Examples of quantitative and qualitative variables.
Variable type Example Handle as
Numeric (continuous) 32.23 km/h, 12.50 km/h, 42.85 km/h Quantitative
Numeric (discrete) with natural ordering 0 children, 1 child, 2 children Quantitative Numeric (discrete) without natural ordering 1 = Sweden, 2 = Denmark, 3 = Norway Qualitative Text (not numeric) Uppsala University, KTH, Lund University Qualitative
Depending on whether the output of a problem is quantitative or qualitative , we refer to the problem as either regression or classification.
Regressionmeans the output is quantitative, andclassificationmeans the output is qualitative.
This means that whether a problem is about regression or classification depends only on its output The input can be either quantitative or qualitative in both cases.
The line between quantitative and qualitative—and therefore between regression and classification—is somewhat arbitrary in data science, and there isn’t always a clear answer For example, treating having no children as qualitatively different from having children and encoding the output as 'children: yes/no' rather than as numeric counts can turn a regression problem into a classification one This shows that the choice of target representation—such as 0, 1, or 2 children versus a binary yes/no—directly shapes the modeling task In short, the distinction depends on how the target variable is defined and represented, not on an absolute difference between quantities and qualities In practice, aligning problem framing with the modeling approach is essential for effective predictive analytics.
Overview of these lecture notes
1.3 Overview of these lecture notes
The following sketch gives an idea on how the chapters are connected.
Chapter 2: The regression problem and linear regression
Chapter 3: The classification problem and three parametric classifiers
Chapter 4: Non-parametric methods for regression and classification: k-NN and trees
Chapter 5: How well does a method perform? Chapter 7: Neural networks and deep learning Chapter 6: Ensemble methods needed recommended
Further reading
Numerous comprehensive textbooks now cover machine learning, often outlining the field in ways that differ from this book The canonical text by Hastie, Tibshirani, and Friedman (2009) presents statistical machine learning with rigorous mathematics and accessibility, and a later lighter edition (James et al., 2013) preserves the core ideas while simplifying the exposition These works emphasize the main concepts but do not delve deeply into Bayesian methods For Bayesian perspectives, several complementary books offer thorough treatments, including Barber (2012), Bishop (2006), and Murphy (2012); MacKay (2003) provides an early, insightful connection between learning and information theory Efron and Hastie (2016) present a constructive historical view of the data-analytic revolution driven by computers A contemporary introduction to the mathematics of machine learning is given by Deisenroth, Faisal, and Ong (2019), and two relatively recent papers that helped introduce the area are Ghahramani (2015) and Jordan & Mitchell (2015).
Machine learning is a highly dynamic field with flagship venues like the International Conference on Machine Learning (ICML) and the Conference on Neural Information Processing Systems (NeurIPS), both held annually and offering free access to the latest research on icml.cc and neurips.cc Other important conferences include the International Conference on Artificial Intelligence and Statistics (AISTATS) and the International Conference on Learning Representations (ICLR) The Journal of Machine Learning Research (JMLR) stands as a leading publication in the area, alongside other top journals.
IEEE Transactions on Pattern Analysis and Machine Intelligence There are also quite a lot of relevant work published within statistical journals, in particular within the area of computational statistics.
2 The regression problem and linear regression
In these notes, regression is introduced as one of the two central problems in supervised learning, the other being classification, with linear regression presented as the first method to tackle the regression problem Although simple, linear regression is surprisingly useful and serves as a foundational building block for more advanced techniques, such as deep learning, which is discussed in Chapter 7.
The regression problem
Regression is the task of learning how a set of input variables x = [x1, x2, , xp]^T relates to a quantitative output y Mathematically, it aims to estimate a model y = f(x) + ε, where ε is the noise or error term that captures everything the model cannot explain From a statistical perspective, ε is treated as a random variable that is independent of x and has an expected value of zero.
In this chapter, we reuse the car-stopping-distance dataset introduced in Example 2.1 to illustrate regression The goal is to learn a regression model that can predict the stopping distance a car needs to come to a full stop based on its current speed.
Ezekiel and Fox (1959) present a dataset comprising 62 observations that document the stopping distance required for different cars at varying initial speeds The dataset centers on two variables: initial speed and the corresponding stopping distance, enabling analysis of how braking distance grows with speed and supporting the development of predictive models for vehicle stopping behavior.
-Speed: The speed of the car when the break signal is given.
-Distance: The distance traveled after the signal is given until the car has reached a full stop.
We decide to interpretSpeedas theinput variablex, andDistanceas theoutput variabley.
Using linear regression, we estimate the stopping distance for an initial speed of 33 mph and 45 mph—two speeds for which no data were recorded in the available dataset The model relies on a linear relationship between speed and stopping distance, allowing straightforward extrapolation beyond observed values Because the dataset is somewhat dated, the resulting estimates may not perfectly reflect how modern cars perform, but the method itself remains informative Readers can treat the data as if it came from their own preferred example, which preserves the interpretation of the extrapolated results for these unobserved speeds In practice, these predictions provide a practical demonstration of how linear regression can be used to predict stopping distance at speeds outside the original data range.
1 We will start with quantitative input variables, and discuss qualitative input variables later in 2.5.
The linear regression model
Describe relationships — classical statistics
Scientists often ask whether a relationship exists between variables—for instance, whether consuming seafood influences lifespan Such questions can be addressed by examining the linear regression parameter β1 after fitting the model to data If β1 equals zero, it suggests no direct association between the outcome y and the predictor x1, unless other inputs also depend on x1 Estimating β1 together with a confidence interval quantifies the uncertainty of the estimate; if zero falls outside this interval at a chosen significance level, we can conclude that x1 and y are correlated This approach is the core of hypothesis testing, a foundational concept in classical statistics Beyond testing, linear regression is also a powerful tool for prediction, allowing us to forecast outcomes for new values of x1 based on the learned relationship.
Predicting future outputs — machine learning
In machine learning, the goal is to predict the output for unseen inputs x* = [x*1, x*2, , x*p]^T To make a prediction for a test input x*, we insert it into the model (2.2) Since ε has zero mean, the prediction is ŷ* = β0 + β1 x*1 + β2 x*2 + … + βp x*p (2.3).
We use the symbolb ony ? to indicate that it is a prediction, our best guess If we were able to somehow observe the actual output fromx ? , we would denote it byy ? (without a hat).
Learning the model from training data
Maximum likelihood
Our strategy to learn the unknown parametersβfrom the training dataT will be themaximum likelihood method The word ‘likelihood’ refers to the statistical concept of the likelihood function, and maximizing the likelihood function amounts to finding the value ofβthat makes observingyas likely as possible. That is, we want to solve maximize β p(y|X,β), (2.8) wherep(y|X,β)is the probability density of the dataygiven a certain value of the parametersβ We denote the solution to this problem—the learned parameters—withβb= [βb0βb1 ã ã ã βbp] T More compactly, we write this as βb = arg max β p(y|X,β) (2.9)
To define what 'likely' means and thus express p(y|X,β) mathematically, we must specify the noise term ε A common choice is that ε follows a Gaussian (normal) distribution with zero mean and variance σ_ε^2, denoted ε ∼ N(0, σ_ε^2) This Gaussian noise assumption directly shapes the likelihood function p(y|X,β) and enables tractable statistical estimation and inference.
This implies that the conditional probability density function of the outputyfor a given value of the input xis given by p(y|x,β) =N y|β 0 +β 1 x 1 +ã ã ã+β p x p , σ ε 2
Furthermore, thenobserved training data points are assumed to beindependentrealizations from this statistical model This implies that the likelihood of the training data factorizes as p(y|X,β) Yn i=1 p(y i |x i ,β) (2.12)
Putting (2.11) and (2.12) together we get p(y|X,β) = 1
2.3 Learning the model from training data
From equation (2.8), we aim to maximize the likelihood with respect to β Because equation (2.13) depends on β only through the sum in the exponent, and the exponential function is monotone, maximizing (2.13) reduces to minimizing the sum that appears in the exponent.
Least squares regression evaluates the fit of a linear model by summing the squared differences between each observed output y_i and its predicted value ŷ_i The prediction for observation i is ŷ_i = β0 + β1 x_i1 + + βp x_ip, where β0 is the intercept, β1 through βp are the coefficients, and x_i1 through x_ip are the predictor values Minimizing this sum of squared residuals across all observations, often denoted as equation (2.14), yields the ordinary least squares (OLS) estimates of the coefficients, defining the best-fit linear relationship in the least-squares sense.
We will revisit how the values β0, β1, , βp can be computed First, it is possible—and sometimes advantageous—to assume that the distribution of ε is not Gaussian For instance, ε could be modeled with a Laplace distribution, which would yield a different cost function.
One key feature of this approach is that it uses the sum of absolute differences between observations rather than the squares of those differences, shaping the loss function around an L1 norm The major advantage of the Gaussian assumption (2.10) is that a closed-form solution exists for βb0, βb1, , βbp, whereas other assumptions on ε usually require more computationally intensive methods.
Remark 2.2 With the terminoloy we will introduce in the next chapter, we could refer to(2.13)as the likelihood function, which we will denote by`(β).
Remark 2.3 It is not uncommon in the literature to skip the maximum likelihood motivation, and just state(2.14)as a (somewhat arbitrary) cost function for optimization.
Least squares and the normal equations
Assuming Gaussian noise ε as stated in (2.10), the maximum-likelihood estimate β̂ solves the optimization problem (2.14) In compact matrix form (2.6), this becomes the least-squares problem: minimize over β the ||Xβ − y||₂², where ||·||₂ denotes the Euclidean norm and ||·||₂² its square From a linear-algebra perspective, this is the task of finding the vector in the subspace of R^n spanned by the columns of X that is closest to y in Euclidean distance The solution is the orthogonal projection of y onto that subspace, and the corresponding β̂ can be shown (Section 2.A) to satisfy the normal equations.
Equation (2.17) is often referred to as thenormal equations, and gives the solution to the least squares problem (2.14, 2.16) IfX T Xis invertible, which often is the case,βbhas the closed form βb = (X T X) − 1 X T y (2.18)
The existence of a closed-form solution is significant and perhaps explains why the least-squares method has become so popular and widely used As discussed, relaxing the assumption that the error term ε is Gaussian leads to challenges for least squares, such as in equation (2.15), where no closed-form solution exists.
Time to reflect 2.1:What does it mean in practice thatX T Xis not invertible? input x output y
Figure 2.2 provides a graphical explanation of the least squares criterion: the optimal model, shown as the blue line, is the one that minimizes the sum of squared prediction errors (the orange areas) across all data points In other words, the line is chosen so that the total orange area—the squared deviations between observed values and model predictions—is as small as possible This minimization principle is what gives the method its name, least squares.
When the columns of X are linearly independent and p equals n−1, X spans the entire R^n, which guarantees a unique β such that y = Xβ, so the model achieves a perfect training fit In this scenario, equation (2.17) simplifies to β = X^{-1}y, producing an exact reproduction of the training data While this perfect fit may seem appealing, it signals overfitting: the model captures every training point at the expense of generalization to unseen data.
By inserting the matrices (2.7) from Example 2.2 into the normal equations (2.6), we obtainβb0=−20.1 andβb1= 3.1 If we plot the resulting model, it looks like this:
With this model, the predicted stopping distance forx?= 33mph isyb?= 84feet, and forx?= 45mph it is b y?= 121feet.
Nonlinear transformations of the inputs – creating more features
The term “linear” in linear regression refers to predicting an output as a linear combination of input features However, what counts as an input isn’t fixed: beyond speed you can include nonlinear transformations of the original variables, such as its square, as additional inputs In fact, you can use arbitrary nonlinear transformations of the original inputs as features in a linear regression model With a single input variable, for example, adding its square (or other nonlinear transforms) lets the model capture nonlinear relationships while remaining linear in the coefficients, enabling effective feature engineering within a linear framework.
2 And also the constant 1, corresponding to the offset β 0 For this reason, affine would perhaps be a better term than linear.
2.4 Nonlinear transformations of the inputs – creating more features input x output y
The maximum likelihood solution for a linear regression model that uses a second-order polynomial—with features x and x^2—is obtained within a linear framework Although the two-dimensional plot appears to show a curved line (as in Figure 2.1), this is merely a plotting artifact; in a three-dimensional representation with x and x^2 on separate axes, the relationship remains affine in the feature space Thus, for the observed data with input x and output y, the estimation identifies the polynomial coefficients that express y as a linear combination of x and x^2.
Using a 4th-order polynomial in the linear regression model yields a maximum likelihood solution with five unknown coefficients This number of coefficients implies the model has enough degrees of freedom to interpolate five data points exactly, i.e., it can fit five samples perfectly, as noted in Remark 2.2 (p = n − 1) In other words, a fourth-degree polynomial introduces five parameters, and the corresponding maximum likelihood estimator can pass through five training observations, illustrating the interpolation capacity of this polynomial feature expansion within linear regression.
Figure 2.3:A linear regression model with 2nd and 4th order polynomials in the inputx, as shown in (2.20). inputx, the vanilla linear regression model is y=β0+β1x+ε (2.19)
However, we can also extend the model with, for instance, x 2 , x 3 , , x p as inputs, and thus obtain a linear regression model which is a polynomial inx, y=β0+β1x+β2x 2 +ã ã ã+βpx p +ε (2.20)
Even though the inputs are transformed (for example using x, x^2, , x^p), the model remains a linear regression because the unknown parameters appear linearly with these inputs The parameters β are learned in the same way as in standard linear regression, but the design matrix X differs between the two forms (2.19) and (2.20) We refer to the transformed inputs as features, and in more complex settings the boundary between the original input and the transformed features can become blurred, with the terms feature and input sometimes used interchangeably.
Figure 2.3 shows two linear regression models that use transformed inputs with polynomial features, illustrating how a model can remain linear in its parameters while producing a curved-looking relationship in the original x–y space The question of whether linear regression can yield a curved line depends on the plot: in a two-dimensional view of x and y, the regression can appear nonlinear, but in the expanded feature space that includes x and x² (and y as the target), the model is affine The same idea applies to Figure 2.3(b), where observing linearity would require a five-dimensional plot corresponding to the transformed features In brief, nonlinear-looking fits arise from linear models when polynomial features are used, because the straightness is preserved only in the appropriately transformed feature space.
Even though the model in Figure 2.3(b) can fit every data point exactly, higher-order polynomials often fail to provide meaningful generalization, with their behavior between and outside the data points appearing peculiar and not well motivated by the data For this reason, higher-order polynomials are rarely used in practical machine learning A more common and effective alternative is the radial basis function (RBF) kernel, which offers flexible modeling with better generalization.
, (2.21) i.e., a Gauss bell centered aroundc It can be used, instead of polynomials, in the linear regression model as y=β 0 +β 1 K c 1 (x) +β 2 K c 2 (x) +ã ã ã+β p K c p (x) +ε (2.22)
This model uses a set of radial-basis functions centered at c1, c2, , cp, with the centers and the length scale l chosen by the user In linear regression, only the coefficients β0, β1, , βp are learned from data, while the centers and the scale remain fixed, as illustrated in Figure 2.4 RBF kernels are generally preferred over polynomial kernels because of their local influence: a small change to one kernel parameter mainly impacts the model in the neighborhood of that center, whereas a similar change in a polynomial kernel affects the model globally.
We continue with Example 2.1, but this time we also add the squared speed as a feature, i.e., the features are nowxandx 2 This gives the new matrices (cf (2.7))
, (2.23) and when we insert them into the normal equations (2.17), the new parameter estimates areβb0 = 1.58, βb1= 0.42andβb2= 0.07 (Note thatβb0andβb1change, compared to Example 2.3.) This new model looks like
With this model, the predicted stopping distance is 87 feet at 33 mph and 153 feet at 45 mph This can be compared to Example 2.3, which yields different predictions Based on the data alone we cannot claim this is the true model, but a visual comparison with Example 2.3 suggests that the model with more features follows the data slightly better A systematic method to select between models with different features, beyond visual plot comparisons, is cross-validation (see Chapter 5) The model includes coefficients c1, c2, c3, c4 and β1, β2, β3, β4 with input x and output y.
Figure 2.4 presents a linear regression model that uses radial basis function (RBF) kernels as features The four kernels are centered at c1, c2, c3, and c4, depicted as dashed gray lines During training, the parameters β0, β1, , βp are learned so that the sum of the kernel responses (the solid blue line) provides the best fit to the data, typically in a least-squares sense This setup enables nonlinear relationships to be modeled within a linear framework by combining multiple RBF kernels.
Polynomials and RBF kernels are among the most well-known nonlinear transformations, but any nonlinear mapping of the inputs is possible To keep track of what has changed, the transformed inputs are typically described as the feature space, distinct from the original inputs When deciding which features to use, one common approach is to compare how models perform with different nonlinear transformations, balancing predictive accuracy, complexity, and practicality to guide feature engineering decisions.
Qualitative input variables
In regression analysis, the target variable y is numeric, while the input features x can be of any type So far, our discussion has focused on quantitative inputs x, but qualitative (categorical) inputs are also perfectly possible and can be accommodated in regression models.
Consider a qualitative input variable that takes only two values, Type A and Type B This two-valued categorical variable can be encoded as a binary dummy variable, x, where x = 0 if the observation is Type A and x = 1 if it is Type B.
1 if type B (2.24) and use this variable in the linear regression model This effectively gives us a linear regression model which looks like y=β 0 +β 1 x+ε(β 0 +ε if type A β 0 +β 1 +ε if type B (2.25)
Choosing the baseline category is arbitrary, so A and B can be swapped without changing the model’s meaning; other coding schemes are also possible, such as assigning x = 1 or x = -1 for a given category instead of the standard zero/one indicator This approach generalizes to qualitative input variables with more than two values—for example A, B, C, and D With four categories, you create three dummy variables (one fewer than the number of categories); for instance, x1 equals 1 if the observation is B, x2 equals 1 if it is C, and x3 equals 1 if it is D, with A serving as the reference category.
0 if not type B, x 2 (1 if type C
0 if not type C, x 3 (1 if type D
0 if not type D (2.26) which, altogether, gives the linear regression model y=β 0 +β 1 x 1 +β 2 x 2 +β 3 x 3 +ε
β 0 +ε if type A β 0 +β 1 +ε if type B β 0 +β 2 +ε if type C β0+β3+ε if type D
Qualitative inputs can be handled similarly in other problems and methods as well, such as logistic regression,k-NN, deep learning, etc.
Regularization
Ridge regression
Inridge regression(also known asTikhonov regularization,L2regularization, orweight decay) the least squares criterion (2.16) is replaced with the modified minimization problem minimize β 0 ,β 1 , ,β p kXβ−yk 2 2+γkβk 2 2 (2.28)
The regularization parameter γ ≥ 0 must be chosen by the user When γ = 0, we recover the original least squares problem (2.16); as γ → ∞, the penalty drives all coefficients β_j toward zero In practice, a good γ lies somewhere in between and depends on the specific problem, and its value can be found by manual tuning or in a more systematic way using cross-validation.
It is actually possible to derive a version of the normal equations (2.17) for (2.28), namely
(X T X+γI p+1 )βb =X T y, (2.29) whereI p+1 is the identity matrix of size(p+ 1)×(p+ 1) Ifγ >0, the matrixX T X+γI p+1 is always invertible, and we have the closed form solution βb = (X T X+γI p+1 ) −1 X T y (2.30)
LASSO
LASSO, short for Least Absolute Shrinkage and Selection Operator, or equivalently L1 regularization, replaces the standard least-squares criterion with the optimization problem min_{β0, β1, , βp} ||Xβ − y||2^2 + γ||β||1, where ||β||1 is the Manhattan norm Unlike ridge regression, there is no closed-form solution for this L1-regularized problem; however, it remains a convex optimization problem that can be solved efficiently by numerical optimization techniques.
With ridge regression, the regularization parameter γ must be chosen by the user; γ = 0 yields the ordinary least-squares problem, and γ → ∞ drives all coefficients β to zero Between these extremes, ridge regression and LASSO diverge in their solutions: ridge regression shrinks all coefficients β0, β1, , βp toward small values, while LASSO tends to produce sparse solutions where only a subset of the coefficients are nonzero and the rest are exactly zero Therefore, the LASSO solution can effectively perform variable selection.
‘switch some of the inputs off’ by setting the corresponding parameters to zero and it can therefore be used as an input (or feature) selection method.
2.6 Regularization Example 2.5: Regularization in a linear regression RBF model
Consider learning a linear regression model (blue line) using p = 8 radial basis function (RBF) kernels as features from n = 9 data points (black dots) Since p = n − 1, the model is expected to fit the data perfectly Yet, as illustrated on the right in panel (a), this setup overfits: the model adapts too tightly to the training samples and exhibits strange, oscillatory behavior between the data points, despite achieving a perfect interpolation at the sample locations.
To address this, ridge regression (b) or LASSO (c) can be used Although the final models produced by ridge regression and LASSO look similar, their coefficient vectors differ: the LASSO solution effectively uses only 5 of the 8 radial basis functions, resulting in a sparse solution Which approach to favor depends on the specific problem.
Model learned with least squares Data
The model trained with a least-squares approach (2.16) fits the training data exactly, but that perfect fit is not necessarily desirable; although it follows the data, the model’s behavior between data points and beyond the observed range is not plausible, indicating overfitting where the model is tuned too closely to the data The estimated coefficients, β_b, take values around 30 and −30.
Model learned with ridge regression Data
Panel (b) shows the same model learned with ridge regression (2.28) using a fixed γ value While not perfectly tailored to the training data, this ridge-regularized model provides a more sensible trade-off between fitting the data and preventing overfitting than model (a), and is likely more useful in most situations The βbare parameter values are now roughly evenly distributed in the range from -0.5 to 0.5.
Model learned with LASSO Data
(c)The same model again, this time learned with
Applying LASSO with gamma = 2.31 yields a model that is not perfectly tailored to the training data but presents a more sensible trade-off between fitting the data and avoiding overfitting than model (a), making it likely more useful in most situations In contrast to model (b), this configuration produces a sparse solution: three of the nine parameters are exactly zero, while the remaining coefficients lie between −1 and 1.
General cost function regularization
Ridge regression and LASSO are two popular regularization methods for linear regression They modify the cost function used to fit the model by introducing a penalty that balances data fit against model complexity Both can be seen as instances of a general regularization scheme that minimizes β V(β, X, y), illustrating how different penalties can be unified within a single framework to improve generalization and reduce overfitting.
In equation (2.32), three key components are present: a data-fit term that measures how well the model matches the observed data, a regularization term that penalizes large parameter values to control model complexity, and a trade-off parameter γ that balances these competing objectives.
Further reading
Linear regression has a history spanning more than two centuries Independently introduced by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809 through the development of the least-squares method, this foundational technique remains central to statistics and machine learning and is extensively covered in major textbooks such as Bishop (2006), Gelman et al (2013), and Hastie, Tibshirani, and Friedman.
While ordinary least squares has a long history, its regularized variants are comparatively recent developments in statistics and numerical analysis Ridge regression, introduced independently by Hoerl and Kennard (1970) and under the name Tikhonov regularization in numerical analysis, provides shrinkage to improve estimator stability in ill-posed settings The LASSO, proposed by Tibshirani (1996), adds sparsity and enables automatic variable selection For a broad treatment of sparse modeling and the LASSO, see the 2015 monograph by Hastie, Tibshirani, and Wainwright, which surveys the key ideas and developments in this area.
2.A Derivation of the normal equations
X T Xβb =X T y. can be derived from (2.16) βb =argmin β kXβ−yk 2 2, in different ways We will present one based on (matrix) calculus and one based on geometry and linear algebra.
No matter how (2.17) is derived, ifX T Xis invertible, it (uniquely) gives βb = (X T X) −1 X T y,
IfX T Xis not invertible, then (2.17) has infinitely many solutionsβb, which all are equally good solutions to the problem (2.16).
V(β) =kXβ−yk 2 2= (Xβ−y) T (Xβ−y) =y T y−2y T Xβ+β T X T Xβ, (2.33) and differentiateV(β)with respect to the vectorβ,
SinceV(β)is a positive quadratic form, its minimum must be attained at ∂β ∂ V(β) = 0, which characterizes the solutionβb as
2.A Derivation of the normal equations
Let X have columns c1, , c_{p+1} The vector β that minimizes the squared residual ||Xβ − y||^2 is precisely the choice for which Xβ is the orthogonal projection of y onto the column space spanned by {c1, , c_{p+1}} Consequently, this orthogonal projection is obtained by solving the normal equations X^T X β = X^T y.
Let us decompose y as y = y⊥ + yk, where y⊥ is orthogonal to the subspace spanned by all columns c_i and yk lies in that subspace Since y⊥ is orthogonal to both yk and Xβ, it follows that ||Xβ − y||^2 = ||(Xβ − yk) − y⊥||^2 = ||Xβ − yk||^2 + ||y⊥||^2 ≥ ||y⊥||^2, and the triangle inequality applied to the split Xβ − y = (Xβ − y⊥) − yk gives ||Xβ − y||^2 ≤ ||y⊥||^2 + ||Xβ − yk||^2.
When we select β so that Xβ matches the observed vector y, the squared residual ||Xβ − y||_2^2 is minimized Therefore the least-squares solution β̂ is characterized by the residual Xβ̂ − y being orthogonal to the column space of X, i.e., to the subspace spanned by all columns c_i This orthogonality condition is equivalent to the normal equations X^T(Xβ̂ − y) = 0, which define the optimal fit in the least-squares sense.
(remember that two vectorsu,vare, by definition, orthogonal if their scalar product,u T v, is0.) Since the columnsc j together form the matrixX, we can write this compactly as
(y−Xbβ) T X= 0, (2.39) where the right hand side is thep+ 1-dimensional zero vector This can equivalently be written as
3 The classification problem and three parametric classifiers
Classification differs from regression in that it produces qualitative outputs, while regression yields quantitative results A classifier is any method that performs classification, and our first classifier will be logistic regression This chapter also introduces linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as foundational classifiers More advanced classifiers—such as classification trees, boosting, and deep learning—will be covered in later chapters.
The classification problem
Classification is the task of predicting a qualitative output from inputs of arbitrary types (quantitative and/or qualitative) Since the output is qualitative, it can assume only values from a finite set, whose size we denote by K—the number of output classes The set of possible values can be as simple as {false, true} (K = 2) or as diverse as {Sweden, Norway, Finland, Denmark} (K = 4) These values are called classes or labels We assume K is known throughout this discussion For compact notation we label the classes with integers 1, 2, , K; the choice of integers is arbitrary and carries no inherent ordering—the labels exist purely as a convenient reference.
Binary classification is the two-class case (K = 2) In this setting, we typically use the labels 0 and 1 (instead of 1 and 2), and we also refer to the two classes as the positive class (k = 1) and the negative class (k = 0) The different label choices in binary classification are used purely for mathematical convenience, helping to simplify formulas and algorithms.
Classification is the task of predicting the output from the input, framed here as predicting class probabilities p(y|x) for y in {1,2, ,K} Here p(y|x) covers both probability masses (when y is a discrete class label) and probability densities (when y is a continuous outcome) In words, p(y|x) is the probability that the output y, a class label, takes a particular value given the input x, and this probability will be a central concept in our discussion We model y as a random variable, reflecting the inherent randomness of the real-world data-generating process, much like the noise term ε in regression.
1 In Chapter 6 we will use k = 1 and k = −1 instead.
Example 3.1: Modeling voting behavior—randomness in the class labely
To describe voting preferences y across different population groups x, we must acknowledge that not every individual in a given group votes for the same party A probabilistic approach treats y as a random variable with a distribution that depends on x, i.e., a conditional probability model P(y | x) For instance, among 45-year-old women (x = 45-year-old women), the probability of voting cerise is 0.13, turquoise 0.39, and purple 0.48, which can be written as P(y = cerise | x = 45-year-old women) = 0.13, P(y = turquoise | x = 45-year-old women) = 0.39, P(y = purple | x = 45-year-old women) = 0.48.
In this way, we use probabilitiesp(y|x)to describe the non-trivial fact that
(a) not all 45 year old women vote for the same party, but
Party preference among 45-year-old women is not completely random The purple party is the most popular choice, while the cerise party is the least popular.
The number of output classes in this example isK = 3.
Logistic regression
Learning the logistic regression model from training data
By applying the logistic function, linear regression is transformed into logistic regression, turning a regression problem into a classification problem The price of this shift is that we cannot use the handy normal equations to learn β in logistic regression as we could in linear regression As with linear regression, we estimate the parameter vector β from the training data T = {(x_i, y_i)}_{i=1}^n using a maximum likelihood approach, solving β* = arg max_β L(β; T).
Let us now work out a detailed expression for the likelihood function 2 ,
This is the function which we would like to optimize with respect toβ, cf (3.5) For numerical reasons, it is often better to optimize the logarithm of `(β) (since the logarithm is a monotone function, the maximizing argument is the same), log`(β) = X i:y i =1 β T x i −log
The simplification in the second equality relies on the chosen labeling, thaty i = 0ory i = 1, which is indeed the reason for why this labeling is convenient.
A necessary condition for the maximum oflog`(β)is that its gradient is zero,
2 We now add β to the expression p(y | x), to explicitly show its dependence also on β.
Note that this equation is vector-valued, i.e., we have a system ofp+ 1equations to solve (withp+ 1 unknown elements of the vectorβ) Contrary to the linear regression model (with Gaussian noise) in Section 2.3.1, this maximum likelihood problem results in anonlinearsystem of equations, lacking a general closed form solution Instead, we are forced to use a numerical solver, as discussed in Appendix B. The standard choice is to use the Newton–Raphson algorithm (equivalent to the so-called iteratively reweighted least squares algorithm), see e.g Hastie, Tibshirani, and Friedman 2009, Chapter 4.4.
Algorithm 1:Logistic regression for binary classification
Data: Training data{x i , y i } n i=1(with output classesy= 0,1and test inputx ?
Decision boundaries for logistic regression
Logistic regression models the class probabilities p(y=0|x) and p(y=1|x) To use it for predicting the label of a test input x*, we first learn the parameter vector β from the training data, then compute p(y=0|x*) and p(y=1|x*), and finally decide the predicted class as the most probable label, y* = arg max_{k ∈ {0,1}} p(y=k|x*).
This is illustrated in Figure 3.2 for a one-dimensional inputx.
A classifier maps every possible input x to a predicted label y, partitioning the input space into regions that share the same prediction The boundary separating these regions—the decision boundary—defines where the predicted class changes In logistic regression, this decision boundary is illustrated in one-dimensional input scenarios (Figure 3.2) and in two-dimensional input scenarios (Figure 3.3).
We can find the decision boundary by solving the equation p(y= 1|x) =p(y= 0|x) (3.11) which with logistic regression gives e β T x
The equationβ T x= 0parameterizes a (linear) hyperplane Hence, the decision boundaries in logistic regression always have the shape of a (linear) hyperplane.
We distinguish between different types of classifiers by the shape of their decision boundary: Since logistic regression only haslineardecision boundaries, it is consequently called alinear classifier.
Figure 3.2 shows binary classification with a single scalar input x on the horizontal axis After learning the parameter β from the training data (not shown), logistic regression provides a probability model for p(y = 1 | x) (blue) and p(y = 0 | x) (red) for any test input x To convert these probabilities into an actual class prediction y_hat ∈ {0, 1}, we select the class with the higher probability The location where the predicted class switches from one label to the other is the decision boundary, shown as a dashed vertical line.
Binary classification with two classes (K = 2) using logistic regression always produces a linear decision boundary In the illustrated data, red dots and green circles represent training samples from the two classes, and the intersection of the red and green regions defines the decision boundary learned by the logistic regression classifier from the training data.
(b) Logistic regression for K = 3 classes We have now introduced training data from a third class, marked with blue crosses The decision boundary between any pair of two classes is still linear.
Figure 3.3:Examples of decision boundaries for logistic regression.
Logistic regression for more than two classes
Logistic regression can be extended to multi-class classification with more than two classes (K > 2) There are several ways to generalize logistic regression to the multi-class problem, and we will follow one approach that also proves useful in deep learning (Chapter 7) This generalization proceeds in two steps: first, we introduce one-hot encoding for the class labels, and second, we replace the logistic function with the softmax function to produce multiclass probabilities. -**Support Pollinations.AI:** -🌸 **Ad** 🌸Powered by Pollinations.AI free text APIs [Support our mission](https://pollinations.ai/redirect/kofi) to keep AI accessible for everyone.
One-hot encoding is a simple, widely used way to represent a categorical output as a binary vector Instead of encoding the output y as an integer in the range {1, , K} (vanilla encoding), we replace y with a K‑dimensional vector y_i where the k-th element is 1 if the original class is k and all other elements are 0 For example, with K = 3, a class label is represented as [1, 0, 0] for class 1, [0, 1, 0] for class 2, and [0, 0, 1] for class 3.
Vanilla encoding One-hot encoding y i = 1 y i 1 0 0T y i = 2 y i 0 1 0T y i = 3 y i 0 0 1T
Since we now have a vector-valued output y, we also need a vector-valued alternative to the logistic function To this end, we introduce the vector-valued softmax function softmax(z), 1
Consider a K-dimensional input vector z = [z1, z2, , zK] The softmax function maps these logits into a probability distribution over the K classes, where each output lies in [0, 1] and the entire vector sums to 1 Just as binary classification combines linear regression with the logistic function to model a single probability, multi-class classification uses linear regression together with the softmax to model the probabilities of all classes The probability of class i is p_i = exp(z_i) / sum_{j=1}^K exp(z_j).
Multiclass logistic regression uses a separate parameter vector beta_k for each class k = 1, , K, so the total number of parameters grows with the number of classes Like binary logistic regression, these parameters are estimated by maximum likelihood Let theta denote the entire collection of class-specific parameters beta_1, , beta_K With one-hot encoding for the labels, the log-likelihood can be written as log p(y | X; theta) = sum_{i=1}^n log p(y_i | x_i; theta) = sum_{i: y_i = 1} log p(1 | x_i; theta) + sum_{i: y_i = 2} log p(2 | x_i; theta) + + sum_{i: y_i = K} log p(K | x_i; theta).
Equation (3.16) expresses the log-likelihood as the sum over K classes: sum_{k=1}^K y_i^k log p(k|x_i; θ), with y_i^k denoting the k-th component of the one-hot encoding vector We will not dwell on further details here, but, similar to the binary case, this likelihood can serve as the objective function in numerical optimization This form of (3.16) arises whenever one-hot encoding is used and is commonly referred to as cross-entropy.
Linear and quadratic discriminant analysis (LDA & QDA)
Using Gaussian approximations in Bayes’ theorem
From probability theory, Bayes’ theorem might be familiar, which says that p(y|x) = p(x|y)p(y)
Classification centers on p(y|x), the left-hand side we care about In a practical machine learning problem, neither p(y|x) nor p(x|y) is known; we only have training data and no explicit equations to guide us Logistic regression tackles this by modeling p(y|x) directly, as in (3.4) By contrast, LDA and QDA focus on the right-hand side by assuming p(x|y) is Gaussian, regardless of the data’s actual shape Since p(x|y) is a distribution over the input x and x is typically high-dimensional, it must be modeled as a multivariate Gaussian with a mean vector and a covariance matrix.
To build a classifier, we learn the Gaussian parameters from the training data—the mean vectors μ_b and the covariance matrices Σ_b—with estimates denoted by hats (μ̂_b, Σ̂_b) In LDA the means can differ across classes but the covariance is assumed the same for all classes, whereas in QDA both the means and the covariances can differ by class Since these parameters are learned from data, we write them with hats as μ̂_b and Σ̂_b The Bayes theorem includes the prior p(y), which is usually unknown, so we approximate p(y) by the observed class frequencies π̂_k in the training data, i.e., p(y=k) ≈ π̂_k For example, if 22% of the training data have label 1, then π̂_1 = 0.22.
Thus, in LDAp(y|x)is modeled as p(y=k|x ? ) = bπ k N x ? |àb k ,Σb
P K j=1πb j N x ? |àb j ,Σb, (3.18) fork= 1,2, , K This is what is shown in Figure 3.4.
3 Note to be confused with Latent Dirichlet Allocation, which is a completely different machine learning method.
4 TODO: This actually assumes that x is quantiative How are qualitative inputs handled in LDA/QDA? x 1 x 2
Training data points x i with label y i = 0
Training data points x i with label y i = 1
Linear Discriminant Analysis (LDA) assumes that the input vector x conditioned on a class label y follows a Gaussian distribution For each class y, the mean μ_y is distinct, while all classes share the same covariance matrix Σ This leads to Gaussian class-conditional densities with different centers but identical spread, which shapes the decision boundaries derived in LDA The accompanying plot illustrates, in the two-feature space (x1, x2), how the training data are modeled under this assumption as LDA is derived.
Level curves forp(x| y = 0) Level curves forp(x | y = 1) Training data points x i with label y i = 0 Training data points x i with label y i = 1
Quadratic discriminant analysis (QDA) assumes that the input vector x, conditioned on the class y, follows a Gaussian distribution; unlike linear discriminant analysis, QDA allows both the mean vector and the covariance matrix to vary across classes, giving each class its own Gaussian shape This flexibility enables QDA to model class-specific data dispersion and orientation The accompanying plot visualizes these assumptions about the training data used when deriving QDA, illustrating how each class may be represented by distinct Gaussian parameters.
Figure 3.4 shows that LDA and QDA are derived by assuming that p(x|y) is Gaussian, treating input variables as random with a defined distribution In LDA, the covariance of the input distribution is the same across all classes, so the class-conditional distributions have identical shape and differ only in their means In QDA, each class can have its own covariance, allowing different shapes for the class-conditional distributions while maintaining Gaussianity.
In fact, when using LDA and QDA in practice, these assumptions on how the inputsxare distributed are rarely satisfied, but this is nevertheless the way we motivate the methods.
In full analogy, for QDA we have (the only difference is the covariance matrixΣb) p(y=k|x ? ) = bπ k N x ? |àb k ,Σb k
There's no restriction on K, and both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) can be applied to binary as well as multi-class classification We will now discuss several aspects in more detail to clarify how these discriminant methods work and when to use each approach in practice.
Using LDA and QDA in practice
By studying Bayes’ theorem, we derive linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) Our focus is on the left-hand side of Bayes’ expression, approached by assuming that the class-conditional distribution p(x|y) is Gaussian In practice, this Gaussian assumption often does not hold or is hard to verify, yet LDA and QDA remain useful classifiers even when the assumption is violated.
To learn an LDA or QDA classifier from labeled training data { (x_i, y_i) } without assuming knowledge of the true p(x|y), you estimate the necessary class-conditional parameters from the data and use those estimates to make predictions Compute the class means μ_y and, for LDA, a pooled covariance Σ (or, for QDA, a separate Σ_y per class), and estimate the class priors π_y from the training labels Then form the discriminant functions: δ_y(x) = x^T Σ^{-1} μ_y − 0.5 μ_y^T Σ^{-1} μ_y + log π_y for LDA, and δ_y(x) = −0.5 log|Σ_y| − 0.5 (x − μ_y)^T Σ_y^{−1} (x − μ_y) + log π_y for QDA, and predict ŷ = argmax_y δ_y(x) This plug-in approach yields predictions without knowing p(x|y) a priori, with practical care given to covariance estimation, regularization in high dimensions, and validation (e.g., cross-validation) to ensure good generalization.
In LDA and QDA, the parameters π_b^k, μ_b^k, and Σ_b (for LDA) or Σ_b^k (for QDA) must be learned from the training data for each component k = 1, , K Among these, the most straightforward parameter to estimate is the class prior π_b^k, which represents how often class k occurs in the training data and is computed as π_b^k = n_k / n, where n_k is the number of samples from class k and n is the total number of samples.
3.3 Linear and quadratic discriminant analysis (LDA & QDA) wheren k is the number of training data samples in classk Consequently, alln k must sum ton, and therebyP kbπ k = 1 Further, the mean vectorà k of each class is learned as b à k = 1 n k
X i:y i =k xi, (3.20b) the empirical mean among all training samples of classk For LDA, the common covariance matrixΣfor all classes is usually learned as Σb = 1 n−K
(x i −àb k )(x i −àb k ) T (3.20c) which can be shown to be an unbiased estimate of the covariance matrix 5 For QDA, one covariance matrixΣ k has to be learned for each classk= 1, , K, usually as Σb k = 1 n k −1
(xi−àb k )(xi−àb k ) T , (3.20d) which similarly also can be shown to be an unbiased estimate.
Remark 3.3 To derive the learning of LDA and QDA, we did not make use of the maximum likelihood idea, in contrast to linear and logistic regression Furthermore, learning LDA and QDA amounts to inserting the training data into the closed-form expressions(3.20), similar to linear regression (the normal equations), but different from logistic regression (which requires numerical optimization).
After learning the parameters π_k, μ_k, and Σ_k for every class k = 1, , K, we obtain a complete model for p(y|x) as specified in equations (3.18) and (3.19) This probabilistic classifier can be used to predict the label for a new input x*, by turning the posterior into a concrete prediction just as in logistic regression The predicted class is y* = arg max_k p(y = k | x*), i.e., the class with the highest posterior probability.
We summarize this by algorithm 2 and 3, and illustrate by Figure 3.5 and 3.6.
Algorithm 2:Linear Discriminant Analysis, LDA
Data: Training data{x i , y i } n i=1 (with output classesk= 1, , K) and test inputx ?
8 Find largestp(y=k|x ? )and setby ? to thatk
5 This means that the if we estimate Σ b like this for new training data over and over again, the average would be the true covariance matrix of p(x).
Algorithm 3:Quadratic Discriminant Analysis, QDA
Data: Training data{x i , y i } n i=1(with output classesk= 1, , K) and test inputx ?
7 Find largestp(y=k|x ? )and setby ? to thatk
Figure 3.5 presents LDA for three classes (K=3) with a one-dimensional input x Each class is modeled by a Gaussian p(x|k) with mean b_k and variance Σ^2, and these parameters are learned from training data The prior p(k) is approximated by bπk, shown in the upper-right panel, and these components are combined through Bayes’ theorem to compute the posterior p(k|x), as displayed in the bottom panel The predicted class is the one with the highest posterior probability, i.e., the topmost solid colored curve in the bottom plot, and for example a value x ≈ 0.7 yields the predicted class k=2 (green) The decision boundaries are the vertical dotted lines in the bottom plot, located where the solid colored posterior curves intersect.
3.3 Linear and quadratic discriminant analysis (LDA & QDA)
Figure 3.6 illustrates Quadratic Discriminant Analysis (QDA) for three classes, laid out the same way as Figure 3.5 Unlike LDA, the learned class-conditional covariance matrices Σ_bk of p(x|k) differ across the classes, as shown in the upper-left panel This class-dependent variance makes the decision boundaries in the bottom panel more intricate and nonlinear than those produced by LDA, exemplified by a small slice of the blue class 3 lying between the red class 1 and green class 2 around -0.5.
Decision boundaries for LDA and QDA
After the model has learned its parameters from training data, we classify a new input x by computing the quantity given in equation (3.18) for each class k and selecting the class with the highest posterior probability p(y|x) In this setup, the prediction is the label corresponding to the maximum p(y|x) across all classes Moreover, equations (3.18) and (3.19) are simple enough to enable a pencil-and-paper analysis of the decision boundary—the boundary in the input space where the predicted class changes from one label to another.
Note that neither the logarithm nor terms independent of k change the location of the maximizing argument (arg max_k), so for LDA we have arg max_k p(y=k|x) = arg max_k log p(y=k|x) = arg max_k [log π_k + log N(x|a_k, Σ_b)].
= arg max k logπ k + logN x|àb k ,Σb
2(x−àb k ) T Σb −1 (x−àb k ) = arg max k logπ k −1
In linear discriminant analysis, the last-row function δ_k^LDA(x) serves as the discriminant function used to classify x The boundary between two class predictions, such as k = 0 and k = 1, is defined by the condition δ_0^LDA(x) = δ_1^LDA(x) Equivalently, the decision boundary separating classes 0 and 1 consists of all x that satisfy this equality, which can be written as x^T Σ^{-1}(μ_0 - μ_1) = 1/2 (μ_0^T Σ^{-1} μ_0 - μ_1^T Σ^{-1} μ_1) + log(π_0/π_1).
2àb T 1 Σb −1 àb 1 +x T Σ − 1 àb 1 ⇔ x T Σ − 1 (àb 0 −àb 1 ) = logbπ1−logbπ0−1
From linear algebra, the set {x : x^T A = c} defines a hyperplane in the x-space Consequently, the decision boundary for linear discriminant analysis is always linear, which explains the name Linear Discriminant Analysis (LDA).
For QDA we can do a similar derivation b y QDA = arg max k logπ k −1
(3.25) and setδ QDA 0 (x) =δ QDA 1 (x)to find the decision boundary as the set of pointsxfor which logbπ 0 −1
This is now on the format{x :x T A+x T Bx =c}, aquadratic form, and the decision boundary for
QDA is thus alwaysquadratic(and thereby also nonlinear!), which is the reason for its namequadratic discriminant analysis.
3.3 Linear and quadratic discriminant analysis (LDA & QDA) x 1 x 2
For linear discriminant analysis (LDA) with two classes (K = 2), the decision boundary is always linear In the accompanying plot, red dots and green circles denote training samples from the two classes, and the line where these class regions meet represents the LDA decision boundary learned from the training data This linear separator in the feature space, defined by the axes x1 and x2, illustrates how LDA distinguishes the two classes using a single straight boundary.
(b) LDA for K = 3 classes We have now introduced training data from a third class, marked with blue crosses. The decision boundary between any two pair of classes is still linear. x 1 x 2
(c) QDA has quadratic (i.e., nonlinear) decision boundaries, as in this example where a QDA classifier is learned from the shown training data. x 1 x 2
(d) With K = 3 classes are the decision boundaries for QDA possibly more complex than with LDA, as in this case (cf (b)).
Figure 3.7 presents the decision boundaries for LDA and QDA, to be compared with Figure 3.3, which shows the logistic regression boundary fitted on the same training data LDA and logistic regression both have linear decision boundaries, though the boundaries are not identical.
(a) With K = 2 classes, Bayes’ classifier tell us to take the class which has probability > 0.5 as the prediction y Here, the prediction would therefore b be y b = 1. y = 1 y = 2 y = 3 y = 4
In a four-class (K = 4) setting, the Bayes classifier selects the predicted label as the class with the highest posterior probability, which in this case yields y_b = 4 This differs from the two-class scenario (K = 2), where it is common for one class to exceed a 0.5 probability; with four classes, there are situations where no single class has a probability greater than 0.5.
Bayes’ classifier — a theoretical justification for turning p(y | x) into y b
Bayes’ classifier
To design a classifier that, on average, makes as few misclassification errors as possible, we seek a decision rule in which the predicted output label y_hat matches the true label y for as many test samples as possible If we knew the posterior probabilities p(y|x) exactly, the optimal classifier would predict y_hat = arg max_k p(y=k|x) for each input x In practice, however, we only have a model—an estimate—for p(y|x); the true probabilities are never known exactly This is the situation in common methods such as logistic regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and other classifiers, where the goal is to approximate p(y|x) as closely as possible and assign the class with the highest estimated posterior probability.
An optimal classifier selects the label with the highest posterior probability given the input x The optimal decision rule, described by equation (3.27), is the Bayes classifier, as illustrated by Figure 3.8 We first establish why this rule is optimal, and then discuss how it connects to other classifiers.
Optimality of Bayes’ classifier
To maximize the average number of correct predictions, we want the chosen prediction y_b (which is not random and is selected by us) to be as likely as possible to equal the true outcome y, which is drawn from the conditional distribution p(y|x) The probability that y_b matches y under this distribution can be written as the expectation over y ~ p(y|x) of the indicator of the event y = y_b: E_{y∼p(y|x)}[ I{ y = y_b } ] This expectation equals p(y_b|x), expressing the likelihood that our chosen prediction is correct given x The indicator function I{⋅} converts the matching event into a numeric quantity that can be averaged to quantify accuracy.
Equation (3.28) shows that the indicator I{y_b = k} multiplied by p(y = k | x) equals p(y_b | x) The first step uses the definition of expected value, and the second step discards all terms that are zero To maximize this quantity, we should choose y_b so that p(y_b | x) is as large as possible, which matches the claim made in (3.27).
3.4 Bayes’ classifier — a theoretical justification for turningp(y|x)intoyb
Bayes’ classifier in practice: useless, but a source of inspiration
Bayes' classifier relies on the conditional probability p(y|x) and assumes we know it; when p(y|x) is known, this approach is optimal and there’s no need to look for alternatives In most machine learning problems, however, we do not know p(y|x); the whole point of learning is that the dependence of y on x is largely unknown beyond what the training data can reveal.
Bayes’ classifier is not a dead end; it remains a meaningful concept In fact, most classifiers discussed in these notes can be understood as approximations of Bayes’ classifier, or, put differently, as methods to learn or estimate p(y|x) from training data Even though not every method has been introduced yet, we provide a concise overview of how several classifiers relate to equation (3.27).
• In binary logistic regressionp(y|x)is modeled as
Under linear and quadratic discriminant analysis (LDA and QDA), the posterior p(y|x) is derived with Bayes’ theorem, where the class-conditional density p(x|y) is modeled as a Gaussian whose parameters (mean and covariance) are estimated from the training data, and the prior p(y) is taken from the empirical distribution of the training labels.
• Ink-nearest neighbor (k-NN),p(y|x) is modeled as the empirical distribution in thek-nearest samples in the training data.
• In tree-based methods,p(y|x)is modeled as the empirical distribution among the training data samples in the same leaf node.
• In deep learning,p(y|x)is modeled using a deep neural network and a softmax function.
Different classifiers model or approximate p(y|x) in various ways, but the default approach is to use Bayes’ classifier (Eq 3.27) to predict y by selecting the class with the highest posterior probability p(y|x) In a binary (two-class) setting, this means predicting the class whose posterior probability is greater than 0.5.
Is it always good to predict according to Bayes’ classifier?
Bayes’ classifier is often not directly available in practice because we only have data and do not observe p(y|x) itself; nevertheless it provides a useful reference when converting a learned approximation of p(y|x) into a concrete prediction We have effectively treated it as our guiding principle throughout this chapter without questioning it further Does Bayes’ classifier dictate that we must always predict the class with the highest probability assigned by our model? Not necessarily While the posterior-maximizing decision (the sensible default option, as in equation 3.27) is a common baseline, we should recognize important caveats and the practical limits of relying solely on the top-posterior choice.
Bayes’ classifier is optimal only when the aim is to minimize the total number of misclassifications However, this objective is not universal: many real-world tasks involve asymmetric error costs For instance, in predicting a patient’s health status, misclassifying someone as ‘well’ may carry far higher stakes than misclassifying them as ‘bad’ (or vice versa, depending on the scenario) When such asymmetry matters, the Bayes’ classifier (3.27) is not necessarily optimal.
Bayes’ classifier achieves optimal performance only when the posterior probability p(y|x) is known exactly When we rely on an approximate p(y|x), the standard Bayes decision rule in equation (3.27) is no longer guaranteed to be the best approach, since errors in the posterior estimate can lead to suboptimal classifications.
6 Sometimes this is not very explicit in the method, but if you look carefully, you will find it.
More on classification and classifiers
Linear and nonlinear classifiers
A regression model which is linear in its parameters is called linear regression (Chapter 2) For the classification problem, the term “linear” is used differently; alinear classifieris a classifier whose decision boundary (for the problem withK = 2classes) is linear, and anonlinear classifieris a classifier which can have a nonlinear decision boundary Among the classifiers introduced in this chapter, logistic regression and LDA are linear classifiers, whereas QDA is a nonlinear classifier, cf Figure 3.3 and 3.7 Note that even though logistic regression and LDA both are linear classifiers, their decision boundaries are not identical All classifiers that will follow in the subsequent chapters, except for decision stumps (Chapter
In linear regression (Section 2.4), you can apply nonlinear transformations to inputs to generate more features, enabling what seems like a rigid linear classifier to form complex decision boundaries, though this approach requires manually crafting and selecting the nonlinear transforms; a more widely used and automatic method to build a powerful classifier from a simple model is boosting, which is discussed in Chapter 6.
Regularization
Like linear regression, overfitting can occur when the number of training samples n is not much larger than the number of input features p A more detailed discussion of overfitting is provided in Chapter 5 Regularization helps reduce overfitting even in classification In logistic regression, a Ridge-like penalty on the coefficient vector β is a common regularization approach (see equation 2.28) For LDA and QDA, regularizing the covariance matrix estimation can also be beneficial, with the relevant formulations given in equations 3.20c and 3.20d.
Evaluating binary classifiers
Binary classification with two classes (K = 2) is widely used to detect the presence of something—such as a disease, a radar target, or another object In this framework, y = 1 (positive) denotes presence and y = 0 (negative) denotes absence These two‑class detection problems share common characteristics, including the need to distinguish true signals from background noise, to produce scores or probabilities for decision making, and to evaluate performance with metrics such as accuracy, precision, recall, ROC-AUC, and F1 score Models are trained to maximize correct detections (true positives) while minimizing false alarms (false positives), under varying data distributions and noise levels.
In many datasets the majority class is 0, so a classifier that always predicts 0 can achieve high accuracy simply by predicting the most frequent label This means a model might look effective if we measure success by accuracy alone For example, a medical support system that always predicts "healthy" would be correct most of the time, but it would be useless because it fails to detect illness when it matters.
(i) A missed detection (predictingyb= 0, when in facty = 1) might have much more sever consequences than a false detection (predictingyb= 1, when in facty= 0).
For such classification problems, there is a set of analysis tools and terminology which we will introduce now.
3.5 More on classification and classifiers
FP/N False positive rate, Fall-out, Probability of false alarm
TN/N True negative rate, Specificity, Selectivity
TP/P True positive rate, Sensitivity, Power, Recall, Probability of detection
FN/P False negative rate, Miss rate
TP/P* Positive predictive value, Precision
Table 3.1:Common terminology related to the quantities (TN, FN, FP, TP) in the confusion matrix.
To visualize a binary classifier’s performance on a test dataset, a confusion matrix groups test samples into four cells defined by the actual label y (0 or 1) and the predicted label b_y (0 or 1) The four cells are true negatives (TN) for y=0 and b_y=0, false negatives (FN) for y=1 and b_y=0, false positives (FP) for y=0 and b_y=1, and true positives (TP) for y=1 and b_y=1 The matrix is typically arranged with the predicted class as rows (b_y=0 and b_y=1) and the actual class as columns (y=0 and y=1), with column totals N (y=0) and P (y=1), and the overall total n = N + P This compact visualization underpins metrics such as accuracy, precision, recall, and F1 score to evaluate classifier performance.
In practice, TN, FN, FP, TP (and also N*, P*, N, P, and n) should be replaced by their actual counts, as shown in the next example A wide body of terminology surrounds the confusion matrix, and Table 3.1 provides a concise summary of these terms to help readers interpret key metrics such as accuracy, precision, recall, and F1 score.
The confusion matrix serves as a quick, informative snapshot of a classifier's performance Depending on the application, distinguishing false positives (FPs, or Type I errors) and false negatives (FNs, or Type II errors) can be crucial Ideally, both FP and FN rates would be zero, but in practice this is rarely achievable, and trade-offs between these errors must be considered when evaluating a model.
Motivated by the Bayes classifier, our default approach is to convert the estimated probability p(y=1|x) into a binary prediction by comparing it to a threshold: ŷ = 1 if p(y=1|x) ≥ t, and ŷ = 0 if p(y=1|x) < t, with t = 0.5 as the standard threshold If the goal is to reduce the false positive rate, we can raise the threshold, which comes at the cost of a higher false negative rate; conversely, lowering the threshold decreases false negatives but increases false positives.
Threshold tuning is often crucial for binary classification performance To compare classifiers such as logistic regression and quadratic discriminant analysis (QDA) on a problem beyond a single threshold, the ROC curve is a useful tool that summarizes performance across all possible thresholds The ROC, or receiver operating characteristics, traces the trade-off between true positive rate and false positive rate and provides a clear picture of how different models perform, with its origins in communications theory.
To plot an ROC curve, the true positive rate (TPR) is plotted against the false positive rate (FPR) across all decision thresholds t in [0,1] The resulting curve typically resembles Figure 3.9 A perfect classifier, which always predicts the correct value with full certainty, yields an ROC curve that touches the upper-left corner, while a classifier that makes random guesses lies on a straight diagonal from bottom-left to top-right.
A concise summary of the ROC curve is its area under the curve, the AUC As shown in Figure 3.9, a perfect classifier achieves an AUC of 1, while a classifier that makes only random guesses yields an AUC of 0.5.
Example 3.2: Confusion matrix in thyroid disease detection
Thyroid, an endocrine gland, produces hormones that regulate metabolic rate and protein synthesis, and thyroid disorders can have serious consequences This study tackles thyroid disease detection using the UCI Machine Learning Repository dataset (Dheeru and Karra Taniskidou, 2017), which comprises 7200 data points, each with 21 medical indicators (both qualitative and quantitative) and a qualitative diagnosis (normal, hyperthyroid, hypothyroid) that is converted to a binary normal vs not normal output The dataset is split into a training set (3772 samples) and a test set (3428 samples) A logistic regression classifier was trained on the training data and evaluated on the test data using a 0.5 threshold, yielding a confusion matrix: y=normal predicted normal 3177, y=normal predicted not normal 237; y=not normal predicted normal 1, y=not normal predicted not normal 13.
Most test data points are correctly predicted asnormal, but a large part of thenot normaldata is also falsely predicted as normal This might indeed be undesired in the application.
To change the picture, we change the threshold tot= 0.15, and obtain new predictions with the following confusion matrix instead: y =normal y =not normal b y =normal 3067 165 b y=not normal 111 85
Switching to the updated approach yields a significantly higher true positive rate (85 patients correctly predicted as not normal, up from 13), but at the cost of a markedly worse false positive rate (111 patients incorrectly predicted as not normal, up from 1) This trade-off’s desirability depends on the specific application, particularly on which type of error carries the heavier real-world consequences and the tolerance for false alarms versus missed detections.
Using total accuracy alone is not very informative for this problem A baseline predictor that always labels instances as normal would achieve roughly 93% accuracy, highlighting how high accuracy can be misleading In comparison, the second confusion matrix yields about 92% accuracy but would likely be far more useful in practice because it reflects the classifier’s ability to distinguish abnormal cases from normal ones.
Typical example Perfect classifier Random guess
4 Non-parametric methods for regression and classification: k -NN and trees
Machine learning models like linear regression, logistic regression, LDA, and QDA rely on a fixed set of parameters learned from training data Once these parameters are estimated and stored, the training data can be discarded, since it is no longer needed for prediction These methods share a fixed model structure: increasing the amount of training data can reduce parameter variance and improve accuracy, but it does not enhance the model's expressive power In particular, logistic regression is limited to linear decision boundaries, regardless of how much training data is available.
Beyond fixed-structure, preset-parameter approaches, there exists a class of methods that adapts to the training data Two representative techniques in this class are k-nearest neighbors (k-NN) and tree-based methods Both can be used for classification and regression, but this discussion concentrates on the classification problem.