Lecture about on deep learning

Lecture notes Selected theoretical aspects of machine learning and deep learning Fran¸cois Bachoc University Paul Sabatier January 22, 2024 Contents 1 Generalities on regression, classif

Regression

We consider a law L on [0,1] d ×R We aim at finding a function f : [0,1] d → R such that, for (X, Y)∼ L,

The optimal function f is then the conditional expectation, as shown in the following proposition. Proposition 1 Let f ? : [0,1] d →Rbe defined by f ? (x) =E(Y|X =x), for x∈[0,1] d Then, for anyf : [0,1] d →R,

From the previous proposition, f ? minimizes the mean square error among all possible functions, and the closer a functionf is tof ? , the more it leads to a small mean square error.

Proof of Proposition 1Let us use the law of total expectation.

Conditionally toX, we can use the equation

= Var(Z|X) + (E(Z|X)−a(X)) 2 for a random variableZ and a function a(X) (bias-variance decomposition) This gives

We analyze a dataset consisting of pairs (X1, Y1), , (Xn, Yn), which are independent and follow a specific probability distribution L Our focus is on a function learned through empirical risk minimization to ensure optimal predictive performance The set F contains functions mapping from the domain [0,1]^d to the real numbers, and the goal is to identify the function fˆn within F that minimizes the empirical risk, providing an effective model for the underlying data.

(f(Xi)−Yi) 2 The next proposition enables to bound the mean square error of ˆf n

Proposition 2 Let (X, Y)∼ L, independently from(X 1 , Y 1 ), ,(X n , Y n ) Then we have

E fˆ n (X)−Y2 the expectation is taken with respect to both (X 1 , Y 1 ), ,(X n , Y n ) and (X, Y).

(f ? (X)−Y) 2 which is always non-negative and is called the excess of risk.

• The first component of the bound is

! which is called thegeneralization error The larger the setF is, the larger this error is, because the supremum is taken over a larger set.

• The second component of the bound is f∈Finf E

(f(X)−f ? (X)) 2 which is called the approximation error The smaller F is the larger this error is, because the infimum is taken over less functions.

• Hence, we see that F should be not too small and not too large, which can be interpreted as a bias-variance trade off.

We let, for >0,f be such that

R(f )≤ inf f ∈FR(f) +. Then we have

, since (X, Y) is independent from X 1 , Y 1 , , X n , Y n and in R( ˆf n ), the function ˆf n is fixed, as the expectation is taken only with respect toX and Y Then we have

Since this inequality holds for any >0, we also obtain the inequality with= 0 which concludes the proof.

Classification

The general principle is quite similar to regression We consider a law L on [0,1] d × {0,1} We are looking for a functionf : [0,1] d → {0,1}(aclassifier) such that with (X, Y)∼ L,

P(f(X)6=Y) is small The next proposition provides the optimal functionf for this.

Proposition 3 Let p ? : [0,1] d →[0,1] defined by p ? (x) =P(Y = 1|X=x) for x∈[0,1] d We let

Hence, we see that a prediction error (that is, predicting f(X) with f(X) 6= T ? (X)) is more harmful when

|1−2p ? (X)| is large This is well interpreted, because when

|1−2p ? (X)|= 0, we have p ? (X) = 1/2, thus P(Y = 1|X) = 1/2 In this case,P(f(X)6=Y|X) = 1/2, regardless of the value of f(X).

Proof of Proposition 3Using the law of total expectation, we have

Conditionally toX we have the following.

We analyze a dataset consisting of independent pairs (X₁, Y₁), , (Xₙ, Yₙ) drawn from a distribution L The focus is on a function learned through empirical risk minimization, aiming to identify the best function within a set F of functions mapping from [0,1]^d to {0,1} The selected function, denoted as f̂ₙ, minimizes the empirical risk over F, ensuring optimal classification performance based on the observed data.

The next proposition enables to bound the probability of error of ˆfn.

Proposition 4 Let (X, Y)∼ L, independently from(X 1 , Y 1 ), ,(X n , Y n ) Then we have

.The proof and the interpretation are the same as for regression.

Neural networks

Neural networks define a set of functions from [0,1] d toR.

Feed-forward neural networks with one hidden layer This is the simplest example These networks are represented as in Figure 1.

Inputs Neurons of the hidden layer

Figure 1: Representation of a feed-forward neural network with one hidden layer.

In Figure 1, the interpretation is the following.

– that there is a multiplication by a scalar

– or that a function from RtoRis applied and (possibly) a scalar is added.

• The function σ:R→Ris called theactivation function.

• A circle (a neuron) sums all the values that are pointed to it by the arrows.

• The column withw 1 , , w N is called the hidden layer.

The function corresponding to Figure 1 is x∈[0,1] d 7→

X i=1 v i σ(hw i , xi+b i ), withhã,ãi the standard inner product onR d

The neural network function is parametrized by

• w1, , wN ∈R d , theweights (of the neurons of the hidden layer),

Examples of activation functions are, for t∈R,

For instance, when d= 1, the network of Figure 2 encodes the absolute value function withσ the ReLU function.

Figure 2: Representation of the absolute value function as a neural network.

Feed-forward neural networks with multiple hidden layers utilize several layers of activation functions to enhance learning capacity and model complex patterns These deep architectures are depicted in Figure 3, illustrating their layered structure and how information flows sequentially through each layer for improved performance in various machine learning tasks.

Figure 3: Representation of a feed-forward neural network with several hidden layer.

The neural network function corresponding to Figure 3 is defined by x∈[0,1] d 7→f v ◦g c ◦gc−1◦ ã ã ã ◦g 1 (x), (1) where f v :R N c →R u→

X i=1 u i v i and for i= 1, , c, withN 0 =d, gi:R N i −1 →R N i is defined by, for u∈R N i −1 and j= 1, , N i ,

.The neural network function is parametrized by

• b (c) 1 , , b (c) N c ∈R, thebiases of the hidden layer c,

• w (c) 1 , , w N (c) c ∈R N c −1 , theweights of the hidden layer c,

2 ∈R, thebiases of the hidden layer 2,

2 ∈R N 1 , the weights of the hidden layer 2,

1 ∈R, thebiases of the hidden layer 1,

1 ∈R d , theweights of the hidden layer 1.

To come back to regression, the class of functionsF corresponding to neural networks is given by

• N 1 , , N c , numbers of neurons in the hidden layers.

These parameters are calledarchitecture parameters Then, for a given architecture,F is a parametric set of functions

1 o For classification, for g∈ F, we take f(x) (1 if g(x)≥0

0 if g(x) m0, we have seen that ΠF(m)≤card(F)≤2 m 0 ≤2 m Hence m6∈ {m∈N; ΠF(m) = 2 m } and thus sup{m∈N; ΠF(m) = 2 m } ≤m0. Remark 17 If VCdim(F) =V i wk ≥0 (k exists since we reach all the possible sign vectors) Then

Since ais non-zero we can assume that there is a j such that a j XW) k ≥ |a j ||x > j w k |>0, the assumption leads to a contradiction since x > j w k < 0 and a j < 0 Consequently, there cannot exist a non-zero vector of size (d+1)×1 such that a > XW = 0, which implies that the (d+1) lines of XW are linearly independent Therefore, the rank of XW is equal to d+1, but since X has a dimension of (d+1)×d, its rank is at most d, leading to a contradiction and confirming the initial hypothesis cannot hold.

Let us now consider F d,a Let x1 

, inR d Then, for anyy1, , y d+1 ∈ {0,1}, write for i= 1, , d+ 1, z i (1 ifyi = 1

Consider the function x∈[0,1] d 7→1 hx, P d j=1 (z j −z d+1 )x j i≥−z d+1 Then for k= 1, , d,

1 hx d+1 , P d j=1 (z j −z d+1 )x j i≥−z d+1 =10≥−z d+1 =1 z d+1 ≥0 =y d+1 Hence we reach the 2 d+1 possible vectors and thus

Hence there existsx1, , x d+2 ∈[0,1] d such that for ally1, , y d+2 ∈ {0,1}, there existsw∈R d and b∈Rsuch that, fork= 1, , d+ 2,

We write ¯ x i x i 1 of size (d+ 1)×1 for i= 1, , d+ 2 and ¯ w w b of size (d+ 1)×1 Then, fork= 1, , d+ 2,

Hence in R d+1 we have shattered d+ 2 vectors ¯x1, ,x¯ d+2 (we have obtained all the possible sign vectors) with linear classifiers This implies

VCdim(F d+1,l )≥d+ 2 which is false since we have shown above that VCdim(F d+1,l ) =d+ 1 Hence we have

Bounding the shattering coefficients from the VC-dimension

From the next lemma, we can bound the shattering coefficients from bounds on the VC-dimension.

Lemma 20 (Sauer lemma) Let F be a non-empty set of functions from [0,1] d to {0,1} Assume thatVCdim(F)0, k

To prove the bound (11) on VCdim(F) we will combine (13) and the next lemma (that we do not prove).

Lemma 25 Let r≥16 and w≥t >0 Then, for anym > t+wlog 2 (2rlog 2 (r)) :=x0, we have

Hence from (13) and by definition of the VC-dimension, Lemma 25 with t=L,w=PL i=1W i and r= 2epR≥2eU ≥16 yields

! log 2 (4epRlog 2 (2epR)) which proves (11).

[AB09] Martin Anthony and Peter L Bartlett Neural network learning: Theoretical foundations. cambridge university press, 2009.

Tiêu đề	Lecture Notes Selected Theoretical Aspects of Machine Learning and Deep Learning
Tác giả	François Bachoc
Trường học	University Paul Sabatier
Chuyên ngành	Machine Learning and Deep Learning
Thể loại	Lecture Notes
Năm xuất bản	2024
Thành phố	Toulouse

Định dạng
Số trang	46
Dung lượng	1,34 MB