Matrix calculus for ml

32 5 Derivatives in General Vector Spaces 34 5.1 A Simple Matrix Dot Product and Norm.. Regression then involves minimizingsome function of the error U B = Y − XB between the model XB an

Applications

Machine learning is filled with key concepts such as parameter optimization, stochastic gradient descent, automatic differentiation, and backpropagation, highlighting the significance of matrix calculus in the field For those interested, exploring these topics further is highly encouraged.

Large physical simulations, particularly in engineering design, often involve a vast array of parameters Understanding the derivatives of simulation outputs concerning these parameters is essential for assessing sensitivity to uncertainties and facilitating large-scale optimization.

The shape of an airplane wing can be defined by numerous parameters, and by calculating the derivative of the drag force from extensive fluid-flow simulations, it becomes possible to optimize the wing design to reduce drag while maintaining specific lift or other constraints.

Topology optimization is an advanced parameterization technique that treats the material at every point in space as a variable, allowing for the discovery of not just optimal shapes but also optimal topologies, which define how materials are interconnected This method has been effectively utilized in mechanical engineering to create intricate lattice structures for components such as airplane wing cross sections and artificial hips, focusing on minimizing weight while maintaining required strength.

In addition to engineering design, complex differentiation challenges emerge when fitting unknown parameters of a model to experimental data and assessing uncertainties in model outputs with imprecise parameters or inputs This situation is closely linked to regression problems in statistics, where the model may consist of a large set of differential equations containing unknown parameters.

Applications: Data science and multivariable statistics

In multivariate statistics, models are typically expressed using matrix inputs and outputs, or more complex structures like tensors A basic example of a linear multivariate matrix model is represented as \$Y(X) = XB + U\$, where \$Y\$ denotes the output, \$X\$ represents the input matrix, \$B\$ is the coefficient matrix, and \$U\$ signifies the error term.

In regression analysis, we aim to determine the unknown coefficient matrix $ B $ while accounting for the random noise matrix $ U $ that hinders a perfect fit to the data The process involves minimizing the error function $ U(B) = Y - XB $, where $ Y $ represents the observed data and $ XB $ denotes the model's predictions A common approach to achieve this minimization is through the use of a matrix norm.

∥U∥ 2 F = trU T U, a determinant detU T U, or more complicated functions Estimating the best-fit coefficients B, analyzing uncertainties, and many other statistical analyses require differentiating such functions with respect to

A recent review by Liu et al (2022) discusses matrix differential calculus and its applications in multivariate linear models and diagnostics This article provides valuable insights into the use of matrix calculus in statistical analysis, highlighting its significance in understanding complex data relationships.

Traditional differential calculus courses focus on symbolic calculus, where students learn to perform tasks that software like Mathematica or Wolfram Alpha can handle While understanding the mechanics behind symbolic differentiation is essential, the course also touches on numerical methods, such as using the difference quotient to approximate derivatives However, modern automatic differentiation diverges from both symbolic and numerical approaches, aligning more closely with compiler technology in computer science Despite this, the mathematical principles underlying automatic differentiation are intriguing and will be explored in this class.

Even approximate computer differentiation is more complicated than you might expect For single-variable functions f(x), derivatives are defined as the limit of a difference [f(x+δx)−f(x)]/δx as δx → 0 A crude

The "finite-difference" approximation estimates the derivative f ′ (x) using a formula with a small δx However, this method introduces several intriguing challenges, including the need to balance truncation and roundoff errors, explore higher-order approximations, and apply numerical extrapolation techniques.

First Derivatives

The derivative of a single-variable function is itself a single-variable function, essentially representing the linearization of that function It can be expressed as $(f(x) - f(x_0)) \approx f'(x_0)(x - x_0)$ This concept simplifies the understanding of scalar functions, which take one input and produce one output Additionally, various notations may be employed to represent this linearization.

This last one will be the preferred of the above for this class One can think ofdxanddyas “really small numbers.”

In mathematics, infinitesimals are rigorously defined through the concept of limits It is important to note that dividing by $dx$ is acceptable for scalars, but this operation is not always valid when dealing with vectors and matrices.

The numerics of such derivatives are simple enough to play around with For instance, consider the function f(x) =x 2 and the point(x0, f(x0)) = (3,9) Then, we have the following numerical values near(3,9): f(3.0001) = 9.00060001 f(3.00001) = 9.0000600001 f(3.000001) = 9.000006000001 f(3.0000001) = 9.00000060000001.

Here, the bolded digits on the left are ∆x and the bolded digits on the right are ∆y Notice that ∆y = 6∆x. Hence, we have that f(3 + ∆x) = 9 + ∆y= 9 + 6∆x =⇒ f(3 + ∆x)−f(3) = 6∆x≈f ′ (3)∆x.

Therefore, we have that the linearization ofx 2 atx= 3is the functionf(x)−f(3)≈6(x−3).

We transition from scalar calculus to the realm of vector and matrix calculus, where Professor Edelman encourages us to view matrices as holistic entities rather than mere tables of numbers.

The concept of linearizing a function extends to defining the derivative for functions that handle multiple inputs and outputs Consequently, the derivative will exhibit a different structure compared to that of a single variable The first derivative's shape varies based on the type of inputs and outputs, as illustrated in the accompanying table The left side of the table lists the function's inputs, while the top displays the corresponding outputs, which can include scalars, vectors, matrices, and higher-order arrays.

In this class, you will gain a comprehensive understanding of differentials and their application in linearization To illustrate this concept, we will examine a specific example.

Letf(x) =x T x, wherexis a2×1matrix and the output is thus a1×1matrix Confirm that2x T 0 dxis indeed the differential off atx0 3 4 T

Then, suppose dx= [.001, 002] Then, we would have that f(x+dx) = (3.001) 2 + (4.002) 2 = 25.022005.

T dx=.022 Hence, we have that f(x0+dx)−f(x0)≈2x T 0 dx=.022.

As we will see right now, the2x T 0 dx didn’t come from nowhere!

Intro: Matrix and Vector Product Rule

For matrices, we in fact still have a product rule! We will discuss this in much more detail in later chapters, but let’s begin here with a small taste.

LetA, B be two matrices Then, we have the differential product rule forAB: d(AB) = (dA)B+A(dB).

By the differential of the matrix A, we think of it as a small (unconstrained) change in the matrix A.Later, constraints may be places on the allowed perturbations.

Notice however, that (by our table) the derivative of a matrix is a matrix! So generally speaking, the products will not commute.

Ifxis a vector, then by the differential product rule we have d(x T x) = (dx T )x+x T (dx).

However, notice that this is a dot product, and dot products commute (since P aiãbi=P biãai), we have that d(x T x) = (2x) T dx.

Remark 3 The way the product rule works for vectors as matrices is that transposes “go for the ride.” See the next example below.

By the product rule we have

1 d(u T v) = (du) T v+u T (dv) =v T du+u T dv since dot products commute.

Remark 5 The way to prove these sorts of statements can be seen in Section2.

𝛿𝑓 slope𝑓 ! 𝑥 small change in “output” small change in “input” linear term higher-order terms δ𝑓 = 𝑓 𝑥 + 𝛿𝑥 − 𝑓 𝑥 = 𝑓 ! 𝑥 𝛿𝑥 + 𝑜(𝛿𝑥)

Figure 1: The essence of a derivative islinearization: predicting a small changeδfin the outputf(x)from a small change δxin the inputx, tofirst order in δx.

In this article, we will explore the concept of derivatives, extending our understanding to higher-order arrays and various vector spaces We will examine differentiation as a linear operator and delve deeper into the key principles we have previously discussed.

Revisiting single-variable calculus

In a first-semester single-variable calculus course, such as MIT's 18.01, the derivative $ f'(x) $ is defined as the slope of the tangent line at the point $ (x, f(x)) $ This concept can also be interpreted as a linear approximation of the function $ f $ near the point $ x $ Specifically, it predicts the change $ \delta f $ in the output $ f(x) $ resulting from a small change $ \delta x $ in the input, expressed as $ \delta f = f(x + \delta x) - f(x) = f'(x) \delta x + \text{(higher-order terms)} $.

Higher-order terms can be expressed using asymptotic "little-o" notation, denoted as $o(\delta x)$, which represents functions that diminish significantly faster than $|\delta x|$ as $\delta x$ approaches 0 For sufficiently small $\delta x$, these terms become negligible compared to the linear term $f'(x)\delta x$ Common examples of such higher-order terms include $(\delta x)^2$, $(\delta x)^3$, $(\delta x)^{1.001}$, and $\frac{\delta x}{\log(\delta x)}$.

Remark 6 Here, δx is not an infinitesimal but rather a small number Note that our symbol “δ” (a Greek lowercase “delta”) is notthe same as the symbol “∂” commonly used to denote partial derivatives.

The concept of a derivative is fundamental and can be related to the initial terms of a Taylor series, expressed as \$f(x+\delta x) = f(x) + f'(x)\delta x + \ldots\$ This idea is more foundational than the Taylor series itself and can be effectively extended to higher dimensions.

1 Briefly, a function g(δx) is o(δx) if lim δx→0 ∥g(δx)∥

∥δx∥ = 0 We will return to this subject in Section 5.2. and other vector spaces In differential notation, we can express the same idea as: df =f(x+dx)−f(x) =f ′ (x)dx.

In this notation we implicitly drop theo(δx)term that vanishes in the limit asδxbecomes infinitesimally small.

We will adopt a more generalized definition of a derivative, avoiding the division by \$dx\$ to accommodate cases where \$dx\$ may represent something other than a number, such as a vector, which cannot be divided.

Linear operators

Directional derivatives

There is an equivalent way to interpret this linear-operator viewpoint of a derivative, which you may have seen before in multivariable calculus: as adirectional derivative.

If we have a functionf(x)of arbitrary vectors x, then the directional derivative at xin a direction (vector)v is defined as:

The expression \[\lim_{\delta \alpha \to 0} \frac{f(x + \delta \alpha v) - f(x)}{\delta \alpha}\]transforms derivatives from arbitrary vector spaces back into single-variable calculus It quantifies the rate of change of the function $f$ in the direction of $v$ from the point $x$ Notably, this limit reveals a straightforward relationship with the linear operator $f'(x)$, as higher-order terms can be neglected when considering the limit as $\delta \alpha$ approaches zero.

)−f(x) =f ′ (x)[dx] =dα f ′ (x)[v], where we have factored out the scalardαin the last step thanks to f ′ (x)being a linear operator Comparing with above, we immediately find that the directional derivative is:

The expression is equivalent to our previous derivative $ f' (x) $ and illustrates an application of the chain rule This perspective highlights that it is entirely valid to input any non-infinitesimal vector $ v $ into $ f' (x)[v] $; the outcome is not a differential but rather a directional derivative.

Revisiting multivariable calculus, Part 1: Scalar-valued functions

Letf be a scalar-valued function, which takes in “column” vectorsx∈R n and produces a scalar (inR) Then, df=f(x+dx)−f(x) =f ′ (x)[dx] =scalar.

The linear operator $ f' (x) $, which generates a scalar $ df $, must be represented as a row vector, also known as a covector or dual vector This row vector is referred to as the transpose of the gradient, and it is essential for understanding the contours of $ f(x) $ at its maximum.

The gradient ∇f of a real-valued function f(x) indicates the "uphill" direction at a point x and is perpendicular to the contours of f While this direction may not lead directly to the nearest local maximum (unless the contours are circular), it serves as a fundamental starting point for various computational optimization algorithms aimed at finding a maximum.

(∇f) T , so thatdf is thedot (“inner”) product ofdxwith the gradient So we have that df=∇fãdx= (∇f) T

Some authors consider the gradient as a row vector, equating it with the derivative or Jacobian, but in this course, we adopt the more common and useful approach of treating it as a column vector This representation allows the gradient to indicate the steepest ascent direction in the space, facilitating generalization to scalar functions in various vector spaces For our purposes, we will define ∇f to match the shape of x, ensuring that df represents the dot product of dx with the gradient This aligns with the traditional understanding of the gradient from multivariable calculus, where it is viewed as a vector of components.

; or, equivalently, df =f(x+dx)−f(x) =∇fãdx= ∂f

To enhance your understanding of vectors, it's essential to view them as cohesive entities rather than just a set of individual components This holistic perspective allows for more elegant differentiation of expressions, avoiding the need for component-by-component derivatives Adopting this approach will facilitate better generalization to more complex input and output vector spaces.

Let’s look at an example to see how we compute this differential.

Consider f(x) =x T Axwhere x∈R n andAis a square n×nmatrix, and thusf(x)∈R Computedf,f ′ (x), and∇f.

We can do this directly from the definition. df=f(x+dx)−f(x)

=x T Ax+dx T Ax+x T Adx+:higher order dx T Adx−x T Ax

In this analysis, we eliminated terms with multiple $dx$ factors as they become asymptotically insignificant Additionally, we utilized the property of scalars to combine $dx^T A x$ and $x^T A dx$, recognizing that they are equal to their own transpose Consequently, we derived that $f'(x) = x^T (A + A^T) = (\nabla f)^T$.

Calculating the gradient component-by-component, as learned in multivariable calculus, involves expressing the function $ f(x) $ in terms of the components of $ x $ as $ f(x) = x^T A x = \sum_{i,j} x_i A_{i,j} x_j $ This requires computing $ \frac{\partial f}{\partial x_k} $ for each $ k $, which can be cumbersome and prone to errors, especially with more complex functions Therefore, it is more efficient to treat vectors and matrices as cohesive entities rather than simple collections of numbers.

Revisiting multivariable calculus, Part 2: Vector-valued functions

In the upcoming Part 2 of our exploration of multi-variable calculus (18.02 at MIT), we will focus on vector-valued functions, where the input is a vector $ x \in \mathbb{R}^n $ and the output is a vector $ f(x) \in \mathbb{R}^m $ In this context, the differential $ df $ will be represented as an $ m $-component column vector, while $ dx $ will be an $ n $-component column vector Our goal will be to derive a linear operator $ f'(x) $ that satisfies the relationship involving the components of $ df $.

, so f ′ (x)must be anm×nmatrix called theJacobian off!

The Jacobian matrixJ represents the linear operator that takesdxtodf: df=J dx

The matrix J has entriesJij = ∂x ∂f i j (corresponding to thei-th row and thej-th column ofJ).

So now, suppose thatf :R 2 →R 2 Let’s understand how we would compute the differential off: df ∂f 1

Consider the function f(x) =AxwhereAis a constant m×nmatrix Then, by applying the distributive law for matrix–vector products, we have df =f(x+dx)−f(x) =A(x+dx)−Ax

Notice then that the linear operatorAis its own Jacobian matrix!

Let’s now consider some derivative rules.

• Sum Rule: Givenf(x) =g(x) +h(x), we get that df =dg+dh =⇒ f ′ (x)dx=g ′ (x)dx+h ′ (x)dx.

The derivative of a function can be expressed as the sum of the derivatives of its components, specifically $ f' = g' + h' $ This relationship illustrates that the linear operator $ f'(x)[v] = g'(x)[v] + h'(x)[v] $ allows for the addition of linear operators, similar to the summation of matrices Consequently, linear operators can be understood as forming a vector space.

• Product Rule: Supposef(x) =g(x)h(x) Then, df =f(x+dx)−f(x)

=gh+dg h+g dh+:0 dg dh−gh

In the expression \$dg + g \, dh\$, the term \$dg \, dh\$ is considered a higher-order term and is therefore omitted in infinitesimal notation It is important to note that \$dg\$ and \$dh\$ may not commute, as they are not necessarily scalars in this context.

Let’s look at some short examples of how we can apply the product rule nicely.

Letf(x) =Ax(mappingR n →R m ) whereA is a constantm×nmatrix Then, df =d(Ax) =* dA x0+Adxx =⇒ f ′ (x) =A.

We have dA= 0here becauseAdoes not change when we changex.

Letf(x) =x T Ax(mappingR n →R) Then, df=dx T (Ax) +x T d(Ax) = dx T Ax

The expression \$x^T A dx = x^T (A + A^T) dx = (\nabla f)^T dx\$ leads to the derivative \$f'(x) = x^T (A + A^T)\$ In cases where \$A\$ is symmetric, this further simplifies to \$f'(x) = 2x^T A\$ Additionally, since \$f\$ is a scalar-valued function, the gradient can be expressed as \$\nabla f = (A + A^T)x\$, which also simplifies to \$2Ax\$ when \$A\$ is symmetric.

 , the element-wise product of vectors (also called theHadamard product), where for convenience below we also definediag(x)as them×mdiagonal matrix withxon the diagonal Then, givenA∈R m,n , definef :R n →R m via f(x) =A(x ∗x).

As an exercise, one can verify the following:

(c) d(x ∗y) = (dx).∗y+x ∗(dy) So if we take y to be a constant and define g(x) =y ∗x, its Jacobian matrix isdiag(y).

(d) df=A(2x ∗dx) = 2Adiag(x)dx=f ′ (x)[dx], so the Jacobian matrix isJ = 2Adiag(x).

(e) Notice that the directional derivative (Sec.2.2.1) off atxin the directionvis simply given byf ′ (x)[v] 2A(x ∗v) One could also check numerically for some arbitrary A, x, v that f(x+ 10 −8 v)−f(x) ≈

The Chain Rule

Cost of Matrix Multiplication

When multiplying an $ m \times q $ matrix by an $ a \times p $ matrix, the process involves calculating $ mp $ dot products of length $ q $, which requires $ q $ multiplications and $ q-1 $ additions for each dot product This results in a total of approximately $ 2mpq $ scalar operations In computational terms, this is expressed as “Θ(mpq),” indicating that the computational effort grows asymptotically proportional to $ mpq $ for large values of $ m $, $ p $, and $ q $.

Matrix multiplication is associative, meaning that $(AB)C = A(BC)$ for all matrices $A$, $B$, and $C$ However, performing multiplication from left to right can be significantly more efficient, especially when the leftmost matrix has only one or a few rows Similarly, the order of applying the chain rule greatly impacts the computational effort needed The left-to-right approach, known as "reverse mode" or "backpropagation," is particularly advantageous when there are many more inputs than outputs.

So why does the order of the chain rule matter? Consider the following two examples.

When dealing with a large number of inputs $ n $ (where $ n \gg 1 $) and a single output $ m = 1 $, the computational cost of reverse mode (left-to-right) is $ \Theta(n^2) $ scalar operations, while forward mode (right-to-left) incurs a significantly higher cost of $ \Theta(n^3) $ This stark contrast in costs is illustrated in Fig 3.

In scenarios where there are many outputs (m ≫ 1) and a single input (n = 1), with numerous intermediate values (q = p = m), reverse mode differentiation requires Θ(m³) operations, while forward mode only necessitates Θ(m²) operations The key takeaway is that when dealing with numerous inputs and few outputs, which is common in machine learning and optimization, it is more efficient to compute the chain rule from left to right using reverse mode Conversely, when there are many outputs and few inputs, the chain rule should be computed from right to left using forward mode Further details will be discussed in Section 8.4.

Beyond Multi-Variable Derivatives

Now let’s compute some derivatives that go beyond first-year calculus, where the inputs and outputs are in more general vector spaces For instance, consider the following examples:

LetAbe ann×nmatrix You could have the following matrix-valued functions For example:

• orU, whereU is the resulting matrix after applying Gaussian elimination toA!

You could also have scalar outputs For example:

• orf(A) =σ 1 (A), the largest singular value ofA.

Let’s focus on two simpler examples for this lecture.

Letf(A) =A 3 whereAis a square matrix Computedf.

Here, we apply the chain rule one step at a time: df A 2 +A dA A+A 2 dA=f ′ (A)[dA].

It is important to note that this expression does not equal $3A^2$ unless $dA$ and $A$ commute, which is typically not the case since $dA$ represents an arbitrary small change in $A$ The right-hand side functions as a linear operator $f'(A)$ acting on $dA$, but it cannot be easily interpreted as a straightforward multiplication of a single "Jacobian" matrix with $dA$.

Letf(A) =A −1 whereA is a square invertible matrix Computedf =d(A −1 ).

In this article, we employ a clever technique by recognizing that the product of a matrix $ A $ and its inverse $ A^{-1} $ equals the identity matrix $ I $ By applying the product rule for differentiation, we find that the differential of the identity matrix is zero, allowing us to derive the relationship $ d(A^{-1}) = -A^{-1} dA A^{-1} $.

In this chapter, we explore how to define the derivative of functions with matrix inputs and outputs as a linear operator, represented by a Jacobian matrix While traditional linear algebra often involves matrices multiplying vectors, we emphasize the importance of understanding linear operations more broadly We introduce the technique of matrix "vectorization" and the Kronecker product, which provide alternative perspectives on linear operators However, it's crucial to recognize that the explicit Jacobian-matrix approach can sometimes obscure essential structures and may be computationally inefficient.

This section includes a link to the Pluto Notebook, which provides computational demonstrations in Julia, showcasing various perspectives on the derivative of the square of 2×2 matrices.

Derivatives of matrix functions: Linear operators

The derivative $ f' $ serves as a linear operator that translates small changes in input into corresponding small changes in output This concept can appear more complex when applied to functions $ f(A) $ that transform matrix inputs $ A $ into matrix outputs For instance, we have previously examined specific functions operating on square $ m \times m $ matrices.

• f(A) =A 3 , which givesdf=f ′ (A)[dA] A 2 +A dA A+A 2 dA.

The matrix-square function is defined as $ f(A) = A^2 $ By applying the product rule, the differential can be expressed as $ df = f'(A)[dA] A + A dA $ Alternatively, this can be derived explicitly using the formula $ df = f(A + dA) - f(A) = (A + dA)^2 - A^2 $, while neglecting the $ (dA)^2 $ term.

The derivative $ f' (A) $ can be expressed using a straightforward formula that connects an arbitrary change $ dA $ in $ A $ to the resulting change in $ f $, defined as $ f(A + dA) - f(A) $ to first order It's important to note that any matrix $ X $ can be substituted into this formula, not just an infinitesimal change $ dA $ For instance, in the case of matrix squaring, we find that $ f' (A)[X] = XA + AX $ for any arbitrary $ X $ This relationship is linear in $ X $; scaling or adding inputs results in corresponding scaling or addition of outputs Specifically, we have $ f' (A)[2X] = 2f' (A)[X] $ and $ f' (A)[X + Y] = f' (A)[X] + f' (A)[Y] $.

Defining a linear operation can be effectively achieved without relying on the traditional notation of $ f' (A)[X] = (some matrix?) \times (X vector?) $ Instead, a straightforward formula like $ XA + AX $ is not only simple to write but also easy to comprehend and compute.

In linear algebra, it is often useful to consider the derivative as a single "Jacobian" matrix Any linear operator can be represented by a matrix when a basis for the input and output vector spaces is chosen This article focuses on a conventional "Cartesian" basis known as "vectorization," which simplifies the representation of linear operators, such as \$AX + XA\$, in matrix form by introducing a new type of matrix product with widespread applications.

A simple example: The two-by-two matrix-square function

The matrix-squaring four-by-four Jacobian matrix

To understand Jacobians of functions (from matrices to matrices), let’s begin by considering a basic question: Question 24 What is the sizeof the Jacobian of the matrix-square function?

The matrix squaring function can be analyzed through its vectorized equivalent, mapping \$\mathbb{R}^4 \to \mathbb{R}^4\$, resulting in a \$4 \times 4\$ Jacobian matrix derived from the derivatives of each output component with respect to each input component For a general \$m \times m\$ matrix \$A\$, the Jacobian of the function \$f(A) = A^2\$ can be computed, yielding an \$m^2 \times m^2\$ matrix due to the \$m^2\$ inputs and outputs Although calculating these partial derivatives can be tedious, symbolic computational tools like Julia or Mathematica simplify the process significantly In the case of \$m = 2\$, one can either manually derive the Jacobian or utilize Julia’s symbolic capabilities for efficient computation.

For example, the first row of f˜ ′ consists of the partial derivatives ofp 2 +qr (the first output) with respect to the

4 inputs p, q, r,ands Here, we have labeled the rows by the (row,column) indices(jout, kout)of the entries in the

The output matrix $ \text{matrixd}(A^2) $ has its columns labeled by the indices $ j $ and $ k $ corresponding to the entries in the input matrix $ A $ While we have represented the Jacobian $ \tilde{f}' $ as a 2D matrix, it can also be conceptualized as a 4D matrix indexed by $ j_{\text{out}}, k_{\text{out}}, j_{\text{in}}, k_{\text{in}} $.

The matrix-calculus perspective of the derivative $ f'(A) $ as a linear transformation on matrices, expressed as $ f'(A)[X] = XA + AX $, offers a more insightful understanding than constructing a detailed component-wise Jacobian $ \tilde{f}' $ This approach provides a formula applicable to any $ m \times m $ matrix without the need to compute $ m^4 $ partial derivatives individually.

To effectively embrace the vectorization perspective, it is essential to regain some of the structure lost in the tedious process of componentwise differentiation A crucial method for connecting these two viewpoints is the use of Kronecker products, a matrix operation that may be unfamiliar to many.

Kronecker Products

Key Kronecker-product identity

In order to convert linear operations like AX+XAinto Kronecker products via vectorization, the key identity is:

Given (compatibly sized) matricesA, B, C, we have

The expression A⊗B can be interpreted as a vectorized form of the linear operation that maps C to BCA^T While it may be appealing to denote the non-vectorized version of this operation as (A⊗B)[C] A^T, it is important to note that this notation is not widely recognized in standard practices.

One possible mnemonic for this identity is that theBis just to the left of theCwhile theA“circles around” to the right and gets transposed.

The identity in question can be understood by examining simpler cases, particularly when either matrix A or B is the identity matrix I of the appropriate size For instance, if we set A equal to I, we can express the relationship as BCA^T To analyze this further, we need to consider the vectorization of the product BC, denoted as vec(BC) By letting ⃗c₁, ⃗c₂, and so on represent the columns of matrix C, we can observe that the product BC results in a straightforward multiplication.

B on the left with each of the columns of C:

Now, how can we get thisvec(BC)vector as something multiplyingvecC? It should be immediately apparent that vec(BC) 

, but this matrix is exactly the Kronecker product I⊗B! Hence, we have derived that

The term $ A^T $ presents a challenge, but when we simplify by setting $ B = I $, we find that $ BCA^T $ is less than $ T $ To vectorize this expression, we examine the columns of $ CA^T $ The first column of $ CA^T $ is derived as a linear combination of the columns of $ C $, with coefficients taken from the first column of $ A^T $, which corresponds to the first row of $ A $ Thus, the first column of $ CA^T $ can be expressed as $ \sum_{j} a_{1j} \vec{c_j} $.

To obtain vec(CA^T), we stack the columns of the matrix, which corresponds to multiplying matrix A by the vector formed by the columns $\vec{c_j}$ This process can be explicitly expressed as vec(CA^T).

, and hence we have derived

The full identity (A⊗B) vec(C) = vec(BCA T ) can then be obtained by straightforwardly combining these two derivations: replaceCA T withBCA T in the second derivation, which replaces⃗cj withB⃗cj and henceIwithB.

The Jacobian in Kronecker-product notation

We will utilize Proposition 27 to compute the Jacobian of the function $ f(A) = A^2 $ using the Kronecker product Let $ dA $ represent our $ C $ in Proposition 27 It follows that we can express the equation $ \text{vec}(A dA + dA A) = (I \otimes A + A^T \otimes I) $.

| {z } Jacobian f ˜ ′ (vec A) vec(dA), whereI is the identity matrix of the same size asA We can also write this in our “non-vectorized” linear-operator notation:

In the 2×2example, these Kronecker products can be computed explicitly:

= ˜f ′ , which exactly matches our laboriously computed Jacobian f˜ ′ from earlier!

For the matrix-cube function A 3 , whereAis anm×msquare matrix, compute them 2 ×m 2 Jacobian of the vectorized function vec(A 3 ).

To simplify the computation of the Jacobian for the matrix-cube function, we can utilize Kronecker products instead of calculating element-by-element partial derivatives This approach is not only more efficient but also more elegant, as it leverages the properties of matrix calculus Remember that our traditional matrix-calculus derivative operates as a linear operator.

(A 3 ) ′ [dA] A 2 +A dA A+A 2 dA, which now vectorizes by three applications of the Kronecker identity: vec(dA A 2 +A dA A+A 2 dA) = (A 2 ) T ⊗I +A T ⊗A+ I⊗A 2

You can explore the Jacobians of matrices such as A 4, A 5, and their linear combinations Additionally, a similar approach can be applied to the Taylor series of any analytic matrix function f(A), although it may become cumbersome In future discussions and assignments, we will examine more refined methods for differentiating various matrix functions, focusing on linear operators rather than vectorized Jacobians.

The computational cost of Kronecker products

When utilizing Kronecker products, it is essential to approach them as a conceptual tool rather than merely a computational one, as their use can significantly increase the computational cost of matrix problems beyond what is required.

In the context of matrix multiplication, the cost of multiplying two $ m \times m $ matrices is proportional to $ \Theta(m^3) $ Consequently, the linear operation $ C \mapsto BCA^T $ also scales as $ \Theta(m^3) $ due to the two $ m \times m $ multiplications involved However, an alternative approach using the vectorization operation, expressed as $ \text{vec}(BCA^T) = (A \otimes B) \text{vec}(C) $, can be employed to compute the same result.

1 Form them 2 ×m 2 matrixA⊗B This requiresm 4 multiplications (all entries ofAby all entries ofB), and

∼m 4 memory storage (Compare to ∼m 2 memory to store A or B If m is 1000, this is a million times more storage, terabytes instead of megabytes!)

2 MultiplyA⊗B by the vectorvecC of m 2 entries Multiplying an n×n matrix by a vector requires∼n 2 operations, and heren=m 2 , so this is again∼m 4 arithmetic operations.

Computing the BCA T using the operation $ (A \otimes B) \text{vec} C $ requires approximately $ m^4 $ operations and $ m^4 $ storage, which is significantly less efficient than the $ m^3 $ operations and $ m^2 $ storage needed for the alternative method This inefficiency arises because the structure of $ A \otimes B $, a specialized $ m^2 \times m^2 $ matrix, is not being fully utilized.

There are many examples of this nature Another famous one involves solving the linearmatrix equations

The Sylvester equation, represented as AX + XB = C for unknown matrix X with given m×m matrices A, B, and C, can be transformed into a system of m² linear equations using Kronecker products This leads to the equation vec(AX + XB) = (I⊗A + B^T⊗I) vecX = vecC, which can be solved for the m² unknowns vecX through Gaussian elimination However, solving an m² × m² system via Gaussian elimination incurs a computational cost of approximately m⁶ operations Fortunately, advanced algorithms exist that can solve AX + XB = C in roughly m³ operations, significantly reducing computational effort—by a factor of about m³, or one billion, for m = 1000.

Kronecker products serve as an efficient computational tool for sparse matrices, which are characterized by having predominantly zero entries The advantage lies in the fact that the Kronecker product of two sparse matrices remains sparse, thus mitigating the significant storage demands associated with dense matrices This property makes it particularly useful for constructing large sparse systems of equations, such as those encountered in multidimensional partial differential equations (PDEs).

In this section, we will be referring to thisJulia notebookfor calculations that are not included here.

Why compute derivatives approximately instead of exactly?

Calculating derivatives by hand can lead to errors, especially with complex functions Despite the simplicity of each step, the risk of mistakes during derivation or implementation on a computer is high To ensure accuracy, it's essential to verify your results by performing an independent calculation A straightforward method for this verification is the finite-difference approximation, which estimates the derivative by comparing the function values at $ f(x) $ and $ f(x + \delta x) $ for one or more finite perturbations $ \delta x $.

Finite-difference techniques vary in sophistication and inherently incur truncation error due to the non-infinitesimal nature of \$\delta x\$ It's important to note that making \$\delta x\$ too small can lead to significant roundoff errors Additionally, these methods become costly in higher dimensions, as a separate finite difference is required for each input dimension to compute the full Jacobian, making them a last resort for accurate derivative calculations However, they serve as a valuable first step in checking derivatives, as even a basic finite-difference approximation can help identify bugs in analytical derivative calculations.

Automatic differentiation (AD) is a reliable method where software and compilers compute analytical derivatives efficiently However, AD tools may struggle with code that interacts with external libraries or certain mathematical structures, leading to inefficiencies In these situations, manually defining the derivative of a small portion of the program can simplify the process compared to differentiating the entire codebase It is also advisable to perform a finite-difference check to verify the accuracy of the manual derivative.

It turns out that finite-difference approximations are a surprisingly complicated subject, with rich connections to many areas of numerical analysis; in this lecture we will just scratch the surface.

Finite-Difference Approximations: Easy Version

The simplest way to check a derivative is to recall that the definition of a differential: df=f(x+dx)−f(x) =f ′ (x)dx came from dropping higher-order terms from a small but finite difference: δf=f(x+δx)−f(x) =f ′ (x)δx+o(∥δx∥).

In calculus, the forward difference approximation is represented by the expression $ f(x+\delta x) - f(x) $, which can be compared to the directional derivative operator $ f'(x)\delta x $ Conversely, the backward difference approximation is given by $ f(x) - f(x-\delta x) \approx f'(x)\delta x $ For the purpose of computing derivatives, there is little practical difference between using forward and backward differences.

In certain Julia automatic differentiation (AD) software, the process is achieved by defining a "ChainRule," while in Python libraries like autograd and JAX, it involves creating a custom "vJp" (row-vector Jacobian product) and/or "Jvp" (Jacobian-vector product).

The distinction becomes more important when discretizing (approximating) differential equations We’ll look at other possibilities below.

Remark 29 Note that this definition of forward and backward difference isnotthe same as forward- and backward- mode differentiation—these are unrelatedconcepts.

When $ \delta x $ is a scalar, we can approximate the derivative $ f' (x) $ using the forward-difference method, expressed as $ f' (x) \approx \frac{f(x + \delta x) - f(x)}{\delta x} + \text{(higher-order corrections)} $ This formulation is widely used; however, it is limited to scalar values of $ x $ In this context, we aim to consider $ x $ as potentially belonging to a different vector space.

Finite-difference approximations are often used as a last resort when deriving analytical derivatives is too complex and automatic differentiation (AD) is not applicable They serve as a valuable tool for verifying analytical derivatives and for conducting quick exploratory analyses.

Example: Matrix squaring

Let’s try the finite-difference approximation for the square functionf(A) =A 2 , where hereAis a square matrix in

The product rule for differentiation states that the derivative of a product is given by $ f' (A) = A \delta A + \delta A A $ However, this expression is not equal to $ 2A \delta A $ because $ A $ and $ \delta A $ do not generally commute To illustrate this difference, we will evaluate it using a finite difference approach with a random input $ A $ and a small perturbation $ \delta A $.

In the context of a random matrix $ A $, we define $ dA = A \cdot 10^{-8} $ By comparing $ f(A + dA) - f(A) $ to $ A \cdot dA + dA \cdot A $, we find that if $ A $ is truly random, the difference between the approximation and the exact product rule yields entries with an order of magnitude around $ 10^{-16} $ In contrast, when compared to $ 2A \cdot dA $, the entries are of order $ 10^{-8} $.

To quantify the accuracy of an approximation, we can calculate the norm $ \| \text{approx} - \text{exact} \| $ and aim for it to be minimal However, it is essential to determine what this small value is compared to; the logical reference point is the correct answer This comparison leads us to the concept of relative error, also known as fractional error, which is defined as $ \text{relative error} = \| \text{approx} - \text{exact} \| $.

The norm ∥ã∥ represents the length of a vector, enabling us to assess the magnitude of the error in the finite difference approximation This understanding is crucial for evaluating the accuracy of the approximation, as discussed in Section 4.1.

So, as above, you can compute that the relative error between the approximation and the exact answer is about

The relative error between 2AdA and the exact answer is approximately \$10^{-8}\$, indicating that our exact answer is likely accurate While a good match between a random input and small displacement does not serve as definitive proof of correctness, it is beneficial to verify This type of randomized comparison is effective in identifying significant bugs, such as errors in calculating the symbolic derivative, as demonstrated in our 2AdA example.

Figure 4 illustrates the accuracy of the forward-difference method for the function $ f(A) = A^2 $, depicting the relative error in $ \delta f = f(A + \delta A) - f(A) $ in relation to the linearization $ f'(A) \delta A $ This relationship is analyzed as a function of the magnitude $ \| \delta A \| $ Here, $ A $ is a $ 4 \times 4 $ matrix with unit-variance Gaussian random entries, while $ \delta A $ represents a unit-variance Gaussian random perturbation scaled by a factor $ s $ that varies from $ 1 $ to $ 10^{-16} $.

The matrix norm computed by norm(A) in Julia serves as the direct analogue of the Euclidean norm for vectors It is defined as the square root of the sum of the squares of the matrix entries.

|Aij| 2 q tr(A T A).This is called the Frobenius norm.

Accuracy of Finite Differences

Now how accurate is our finite-difference approximation above? How should we choose the size of δx?

In this analysis, we examine the function $ f(A) = A^2 $ and illustrate the relative error as a function of $ \| \delta A \| $ The results will be presented on a logarithmic scale, specifically a log-log plot, to effectively highlight the power-law relationships, which will appear as straight lines.

We notice two main features as we decreaseδA:

1 The relative error at first decreases linearly with∥δA∥ This is called first-order accuracy Why?

2 WhenδA gets too small, the error increases Why?

Order of accuracy

The truncation error occurs due to the non-infinitesimal nature of the input perturbation \$\delta x\$, leading to a difference rather than a derivative When the truncation error in the derivative scales proportionally to \$\|\delta x\|^n\$, the approximation is considered n-th order accurate In the case of forward differences, the order is \$n=1\$.

For anyf(x)with a nonzero second derivative (think of the Taylor series), we have f(x+δx) =f(x) +f ′ (x)δx+ (terms proportional to∥δx∥ 2 ) + o(∥δx∥ 2 )

That is, the terms wedropped in our forward-difference approximations are proportional to∥δx∥ 2 But that means that the relative error is linear: relative error= ∥f(x+δx)−f(x)−f ′ (x)δx∥

= (terms proportional to∥δx∥ 2 ) +o(∥δx∥ 2 ) proportional to∥δx∥ = (terms proportional to∥δx∥) +o(∥δx∥)

First-order accuracy refers to the level of precision in numerical methods, where the truncation error arises from using a finite-difference approximation with a non-infinitesimal value of \$\delta x\$ While it may seem logical to minimize \$\delta x\$ to reduce this error, it is essential to consider the implications of doing so.

Roundoff error

The increase in error for very small values of δA is attributed to roundoff errors inherent in floating-point arithmetic, where computers store a limited number of significant digits (approximately 15 decimal digits) and round off excess digits during calculations When δx is too small, the difference $ f(x + \delta x) - f(x) $ may be rounded to zero due to catastrophic cancellation, where significant digits cancel out Floating-point arithmetic resembles scientific notation, with a finite-precision coefficient scaled by a power of 10 (or 2 in computing) The precision in 64-bit floating-point arithmetic is characterized by machine epsilon, approximately $ 2.22 \times 10^{-16} $ The roundoff error when rounding a real number $ y $ to the nearest floating-point value is bounded by $ |y - \tilde{y}| \leq \epsilon |y| $, indicating that the computer retains about 15-16 decimal digits or 53 binary digits for each number.

In finite-difference calculations, when the ratio $\frac{\|\delta A\|}{\|A\|}$ is approximately $10^{-8} \approx \sqrt{\epsilon} \|A\|$ or larger, the truncation error dominates the approximation of $f'(A)$ However, reducing this ratio further leads to an increase in relative error due to roundoff A common guideline is that $\|\delta x\| \approx \sqrt{\epsilon} \|x\|$, suggesting that relying on about half the significant digits is generally safe Nonetheless, the exact point at which minimum error occurs varies based on the function $f$ and the finite-difference method used, and this rule of thumb may not always be entirely dependable.

Other finite-difference methods

Advanced finite-difference methods, such as Richardson extrapolation, utilize a series of increasingly smaller $\delta x$ values to optimize the estimation of $f'$ by extrapolating to $\delta x \to 0$ with higher-degree polynomials Additionally, employing higher-order difference formulas can significantly reduce truncation error more rapidly than linear rates with respect to $\delta x$ A well-known higher-order formula is the centered difference, given by $f' (x) \approx \frac{f(x+\delta x) - f(x-\delta x)}{2\delta x}$, which achieves second-order accuracy, resulting in a relative truncation error proportional to $\|\delta x\|^2$.

Higher-dimensional inputs present a significant challenge for finite-difference techniques, as each dimension of the input requires a separate finite difference to compute the gradient For instance, when dealing with a function $ f(x) $ in $ \mathbb{R}^n $, obtaining the full gradient $ \nabla f $ necessitates $ n $ distinct finite differences This approach becomes costly and impractical for high-dimensional optimization problems, such as those encountered in neural networks, where $ n $ can be very large However, when using finite differences primarily for debugging purposes, it is often adequate to evaluate $ f(x + \delta x) - f(x) $ against $ f'(x)[\delta x] $ in a few random directions with small perturbations $ \delta x $.

5 Derivatives in General Vector Spaces

Matrix calculus extends the concepts of derivatives and gradients to functions with inputs and outputs that are not just scalars or column vectors This involves generalizing the traditional vector dot product and Euclidean vector length to accommodate general inner products and norms in vector spaces We will begin by examining well-known matrices through this broader perspective.

A set $ V $ is defined as a "vector space" in linear algebra if its elements can be added or subtracted, denoted as $ x \pm y $, and multiplied by scalars $ \alpha x $, while adhering to fundamental arithmetic axioms such as the distributive law For instance, the collection of $ m \times n $ matrices constitutes a vector space, as does the set of continuous functions $ u(x) $.

In the context of real-valued functions, the ability to add, subtract, and scale elements within the same set is essential This property proves to be highly beneficial when extending differentiation to various spaces, such as functions that map matrices to matrices or functions that yield numerical outputs A critical aspect of this process is ensuring that the input and output vector spaces, denoted as V, possess a norm and, ideally, an inner product.

A Simple Matrix Dot Product and Norm

Recall that for scalar-valued functions f(x)∈ Rwith vector inputs x∈R n (i.e n-component “column vectors") we have that df=f(x+dx)−f(x) =f ′ (x)[dx]∈R.

The derivative $ f' (x) $ acts as a linear operator that transforms the vector $ dx $ into a scalar value Alternatively, it can be represented as the row vector $ (\nabla f)^T $ From this perspective, the differential $ df $ can be expressed as the dot product, or inner product, given by $ df = \nabla f \cdot dx $.

In any vector space $ V $ equipped with inner products, we can define a linear operator $ f' (x)[dx] \in \mathbb{R} $, known as a "linear form," for a scalar-valued function $ f $ and a vector $ x \in V $ To establish the gradient $ \nabla f $, it is essential to have an inner product in $ V $, which serves as the vector-space extension of the traditional dot product.

Givenx, y∈V, the inner product⟨x, y⟩is a map (ã) such that⟨x, y⟩ ∈R This is also commonly denotedxãy or ⟨x|y⟩ More technically, an inner product is a map that is

1 Symmetric: i.e ⟨x, y⟩=⟨y, x⟩(or conjugate-symmetric, 4 ⟨x, y⟩=⟨y, x⟩, if we were using complex numbers),

3 Non-negative: i.e ⟨x, x⟩:=∥x∥ 2 ≥0, and= 0if and only if x= 0.

The combination of the first two properties indicates that the function must be linear in the left vector, or conjugate-linear in the case of complex numbers Additionally, a significant outcome of these properties is the Cauchy–Schwarz inequality, which states that \$|⟨x, y⟩| \leq \|x\| \|y\|\$.

A "row vector" can be formally defined as a "covector," "dual vector," or an element of a "dual space," which should not be mistaken for the dual numbers utilized in automatic differentiation.

Some authors differentiate between the "dot product" and the "inner product" in complex vector spaces, noting that the dot product lacks complex conjugation, leading to the property $ x \cdot y = y \cdot x $ and allowing $ x \cdot x $ to be non-real and not equal to $ \|x\|^2 $ In contrast, the inner product is required to be conjugate-symmetric, expressed as $ \langle x, y \rangle = \overline{x} \cdot y $ This distinction can create confusion in the context of complex vector spaces, as various mathematical fields may define these concepts differently.

In the expression ⟨x, y⟩ = x ã ¯ y, the right argument is conjugated rather than the left, ensuring linearity in the left argument and conjugate-linearity in the right argument This approach simplifies our focus on real numbers.

A (complete) vector space with an inner product is called a Hilbert space (The technical requirement of

Completeness in a vector space refers to the property that every Cauchy sequence of points converges to a limit within the space This concept is crucial for rigorous mathematical proofs Typically, completeness is satisfied in vector spaces over real or complex numbers; however, it becomes more complex in the context of function spaces, where the limit of a sequence of continuous functions may result in a discontinuous function.

In a Hilbert space, we can define the gradient for scalar-valued functions For a scalar function $ f(x) $ defined on a Hilbert space $ V $, the linear form $ f'(x)[dx] $ belongs to the real numbers $ \mathbb{R} $ The Riesz representation theorem asserts that any linear form, including $ f' $, can be expressed as an inner product with a corresponding vector.

That is, the gradient ∇f is defined as the thing you take the inner product of dx with to getdf Note that∇f always has the “same shape” as x.

The first few examples we look at involve the usual Hilbert spaceV =R n with different inner products.

Given V =R n withn-column vectors, we have the familiar Euclidean dot product⟨x, y⟩=x T y This leads to the usual∇f.

We can have different inner products onR n For instance,

More generally we can define a weighted dot product⟨x, y⟩W =x T W yfor any symmetric-positive-definite matrixW (W =W T andW is positive definite, which is sufficient for this to be a valid inner product).

Changing the definition of the inner product alters the gradient's definition For instance, with the function $ f(x) = x^T A x $, we previously derived $ df = x^T (A + A^T) dx $, leading to the gradient $ \nabla f = (A + A^T)x $ under the standard Euclidean inner product However, when employing a weighted inner product $ x^T W y $, the gradient transforms to $ \nabla_{(W)} f = W^{-1}(A + A^T)x $, resulting in $ df = \langle \nabla_{(W)} f, dx \rangle $.

In this article, we will utilize the Euclidean inner product for vectors $ x \in \mathbb{R}^n $ and the standard gradient notation $ \nabla f $, unless specified otherwise It is important to note that weighted inner products can be beneficial in various situations, particularly when the components of $ x $ have differing scales or units.

The space of $ m \times n $ matrices, denoted as $ V = \mathbb{R}^{m \times n} $, has a vector-space isomorphism that maps a matrix $ A $ to its vectorized form $ \text{vec}(A) \in \mathbb{R}^{mn} $ In this context, we can define an analogue of the familiar Frobenius Euclidean inner product, which can be conveniently expressed using matrix operations through the trace function.

TheFrobenius inner productof twom×nmatricesAandB is:

A ij B ij = vec(A) T vec(B) = tr(A T B). Given this inner product, we also have the corresponding Frobenius norm:

We can now establish the gradient of scalar functions that utilize matrix inputs, which will serve as our standard matrix inner product and define the default matrix gradient in this discussion, occasionally omitting the F subscript.

Firstly, by the familiar scalar-differentiation chain and power rules we have that df= 1

Then, note that (by linearity of the trace) d(trB) = tr(B+dB)−tr(B) = tr(B) + tr(dB)−tr(B) = tr(dB).

Here, we used the fact thattrB = trB T , and in the last step we connecteddf with a Frobenius inner product In other words,

∥A∥ F Note that one obtains exactly the same result for column vectorsx, i.e.∇∥x∥=x/∥x∥(and in fact this is equivalent viax= vecA).

Let’s consider another simple example:

Fix some constant x∈R m ,y∈R n , and consider the functionf :R m×n →Rgiven by f(A) =x T Ay.

We have that df=x T dA y

More generally, for any scalar-valued functionf(A), from the definition of Frobenius inner product it follows that: df=f(A+dA)−f(A) =⟨∇f, dA⟩=X i,j

(∇f)i,jdAi,j, and hence the components of the gradient are exactly the elementwise derivatives

Taking the derivative of non-trivial matrix-input functions $ f(A) $ with respect to each entry can be quite challenging, similar to the component-wise definition of the gradient vector in multivariable calculus.

A individually Using the “holistic” matrix inner-product definition, we will soon be able to compute even more complicated matrix-valued gradients, including∇(detA)!

Derivatives, Norms, and Banach spaces

In this class, we have frequently referred to the term "norm," which is a fundamental concept in mathematics A well-known example is the Euclidean norm, represented as $\|x\| = \sqrt{\sum_{i=1}^{n} x_i^2}$ for $x \in \mathbb{R}^n$ However, it is important to understand how norms extend to various vector spaces Notably, norms play a vital role in defining derivatives, highlighting their significance in mathematical analysis.

Given a vector spaceV, a norm∥ã∥onV is a map∥ã∥:V →Rsatisfying the following three properties:

A vector space that has a norm is called an normed vector space Often, mathematicians technically want a slightly more precise type of normed vector space with a less obvious name: a Banach space.

A (complete) vector space with a norm is called a Banach space (As with Hilbert spaces, “completeness” is a technical requirement for some types of rigorous analysis, essentially allowing you to take limits.)

For example, given any inner product⟨u, v⟩, there is a corresponding norm∥u∥=p

⟨u, u⟩ (Thus, every Hilbert space is also a Banach space 5 )

To define derivatives, we technically need both the input and the output to be Banach spaces To see this, recall our formalism f(x+δx)−f(x) =f ′ (x)[δx]

To precisely define the sense in which theo(δx)terms are “smaller” or “higher-order,” we need norms In particular, the “little-o” notationo(δx)denotes any function such that lim δx→0

The Fréchet derivative extends the concept of differentiation to arbitrary normed or Banach spaces, requiring both the input and output to possess norms This derivative is characterized by the condition that the norm of the change in input, denoted as ∥δx∥, approaches zero faster than linearly as δx approaches zero.

5 Proving the triangle inequality for an arbitrary inner product is not so obvious; one uses a result called the Cauchy–Schwarz inequality.

6 Nonlinear Root-Finding, Optimization, and Adjoint Differentiation

In this section, we will explore the fundamental reasons for computing derivatives and delve deeper into their significance, followed by a discussion on the methods used for derivative computation.

Newton’s Method

Scalar Functions

To find the root of a scalar function $ f : \mathbb{R} \to \mathbb{R} $, such as $ f(x) = x^3 - \sin(\cos x) $, we may not always achieve explicit solutions, especially for complex functions However, Newton's method offers an effective way to approximate the root to any desired accuracy, provided we have an initial guess This technique utilizes linear algebra to approximate the function with a straight line, making it easier to identify an approximate root, which can then be refined as a new guess.

• Linearizef(x)near somexusing the approximation f(x+δx)≈f(x) +f ′ (x)δx,

• and then use this to update the value ofxwe linearized near—i.e., letting the new xbe xnew=x−δx=x+ f(x) f ′ (x).

Once you are close to the root, Newton’s method converges amazingly quickly As discussed below, it asymptotically doubles the number of correct digits on every step!

When the derivative $ f' (x) $ is not invertible, such as when $ f' (x) = 0 $, Newton's method can fail This breakdown of Newton's method can lead to significant issues in finding roots For further insights, refer to examples illustrating when Newton's method encounters difficulties.

Multidimensional Functions

Newton's method can be extended to multidimensional functions, where a function $ f: \mathbb{R}^n \rightarrow \mathbb{R}^n $ takes a vector input and produces a vector output of the same size This allows us to implement a Newtonian approach in higher dimensions.

• Linearizef(x)near somexusing the first-derivative approximation f(x+δx)≈f(x) + f ′ (x)

5 4 3 2 1 0 1 2 3 4 f ( x ) initial x x new f ( x new ) root f ( x ) one Newton step

In Figure 5, we illustrate a single step of the scalar Newton's method applied to the nonlinear function $ f(x) = 2 \cos(x) - x + \frac{x^2}{10} $ to find its root Starting with an initial guess of $ x = 2.3 $, we utilize both $ f(x) $ and its derivative $ f'(x) $ to create a linear approximation of the function The next estimate, $ x_{\text{new}} $, is determined as the root of this linear approximation Newton's method demonstrates rapid convergence to the exact root, provided the initial guess is sufficiently close to it, as indicated by the black dot.

• and then use this to update the value ofxwe linearized near—i.e., letting the new xbe xnew=xold−f ′ (x) −1 f(x).

Once the Jacobian is obtained, we can efficiently solve a linear system at each step, achieving rapid convergence that doubles the accuracy with each iteration, a phenomenon known as "quadratic convergence." However, it is crucial to have a starting guess for $ x $ that is sufficiently close to the root; otherwise, the algorithm may fail to converge or behave unpredictably This highlights the practical significance of Jacobians and derivatives in numerical methods.

Optimization

Nonlinear Optimization

Large-scale differentiation plays a crucial role in nonlinear optimization, particularly in machine learning applications When minimizing or maximizing a scalar-valued function $ f: \mathbb{R}^n \to \mathbb{R} $, such as a loss function in a neural network with millions of parameters, the goal is to reduce $ f $ by moving "downhill" in the direction of steepest descent, represented by $-\nabla f$ Remarkably, even with a million parameters, we can adjust all of them simultaneously in this direction, and the computational cost of calculating all derivatives is comparable to evaluating the function itself.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 2 minimum f(x) contours steepest-descent steps steepest-descent minimization

The steepest-descent algorithm effectively minimizes a function $ f(x) $ by taking successive steps in the direction of $ -\nabla f $, as illustrated in the example of minimizing a quadratic function in two dimensions However, this method can experience slow convergence due to "zig-zagging" along narrow valleys, a challenge that can be addressed with advanced techniques such as momentum terms and second-derivative information Additionally, employing reverse-mode or backpropagation methods allows for efficient large-scale optimization, making it practical for applications like training neural networks, optimizing airplane wing shapes, and managing investment portfolios.

Nonlinear optimization presents numerous practical challenges that complicate the process significantly These complexities extend beyond what can be addressed in a single lecture or even an entire course Here, we provide a few illustrative examples to highlight these difficulties.

In machine learning, determining the appropriate step size in the downhill direction, known as the "learning rate," is crucial for effective convergence While larger steps can accelerate the process, excessively large steps may lead to inaccuracies since the gradient only provides a local approximation of the function Various strategies exist to optimize this step size for better performance.

– Line search: using a 1D minimization to determine how far to step.

– A “trust region” bounding the step size (where we trust the derivative-based approximation off) There are many techniques to evolve the size of the trust region as optimization progresses.

When minimizing a function $ f(x) $ under constraints such as $ g_k(x) \leq 0 $ or $ h_k(x) = 0 $, known as inequality and equality constraints, it is essential to identify feasible points $ x $ that meet these conditions To advance towards the optimal feasible solution, a combination of the gradients $ \nabla f $ and $ \nabla g_k $ is typically employed to approximate or linearize the problem.

To improve convergence when navigating narrow valleys in optimization, simply descending straight downhill can lead to slow progress due to zig-zagging Techniques such as momentum terms and conjugate gradients can help address this issue A more advanced approach involves estimating second-derivative Hessian matrices from a series of gradient values, with the BFGS algorithm being a well-known example This method utilizes the Hessian to perform approximate Newton steps, aiming for the point where the gradient equals zero Further discussion on Hessians will be provided in a later lecture.

To find the most effective solution for a specific problem, it is essential to explore various techniques and a multitude of competing algorithms While numerous books on optimization algorithms exist, they can only provide a limited overview of the vast array of options available.

When tackling a problem, the key lies not just in selecting the right algorithms but in accurately formulating the mathematical aspects, including the appropriate function, constraints, and parameters If your problem involves numerous parameters (more than ten), it is crucial to utilize an analytical gradient, computed efficiently in reverse mode, rather than relying on finite differences.

Engineering/Physical Optimization

Optimization has numerous applications beyond machine learning, including engineering and physical optimization For example, designing an airplane wing to maximize strength illustrates the importance of optimization in various fields.

1 You start with some design parametersp, e.g describing the geometry, materials, forces, or other degrees of freedom.

2 Thesepare then used in some physical model(s), such as solid mechanics, chemical reactions, heat transport, electromagnetism, acoustics, etc For example, you might have a linear model of the formA(p)x=b(p)for some matrixA(typically very large and sparse).

3 The solution of the physical model is a solution x(p) For example, this could be the mechanical stresses, chemical concentrations, temperatures, electromagnetic fields, etc.

4 The physical solutionx(p)is the input into some design objectivef(x(p))that you want to improve/optimize. For instance, strength, speed power, efficiency, etc.

5 To maximize/minimizef(x(p)), one uses the gradient∇pf, computed using reverse-mode/“adjoint” methods, to update the parameterspand improve the design.

Researchers have utilized "topology optimization" to design a chair by optimizing each voxel of the structure, which represents the presence of material in that space This process not only determines the optimal shape but also the optimal topology, including material connections and the number of holes, to support a specific weight while minimizing material usage Such techniques have also been applied to practical challenges, ranging from airplane wings to optical communications.

Reverse-mode “Adjoint” Differentiation

Nonlinear equations

You can also apply adjoint/reverse differentiation to nonlinear equations For instance, consider the gradient of the scalar functiong(p) =f(x(p)), wherex(p)∈R n solves some system ofnequationsh(p, x) = 0∈R n By the chain rule, h(p, x) = 0 =⇒ ∂h

(This is an instance of theImplicit Function Theorem: as long as ∂h ∂x is nonsingular, we can locally define a function x(p)from an implicit equationh= 0, here by linearization.) Hence, dg=f ′ (x)dx=−f ′ (x)

Associating left-to-right yields a single "adjoint" equation, (∂h/∂x) T v = f ′ (x) T = ∇ x f, which enables efficient computation of both g and ∇g through two solves: a nonlinear "forward" solve for x and a linear "adjoint" solve for v Subsequently, all derivatives ∂g/∂pk can be obtained via inexpensive dot products Notably, the linear "adjoint" solve involves the transposed Jacobian ∂h/∂x, making it comparable in cost to a single Newton step for solving h = 0 for x, implying that the adjoint problem is likely cheaper than the forward problem.

Adjoint methods and AD

Understanding adjoint methods is essential even when using automatic differentiation (AD) systems, as it helps determine when to apply forward- versus reverse-mode AD Many physical models rely on extensive software packages developed over the years in various programming languages that may not support automatic differentiation In such cases, providing a “vector–Jacobian product” allows AD to differentiate the remaining components effectively Additionally, since many models involve approximate calculations, AD tools may inefficiently differentiate through errors in these approximations For instance, when solving a nonlinear system using iterative methods like Newton’s, manually written derivative rules can significantly enhance efficiency by leveraging the implicit-function theorem instead of relying on naive AD, which would differentiate through all iterations.

Adjoint-method example

In this section, we conclude with an example demonstrating the efficient computation of a derivative using the "adjoint method." We encourage readers to attempt solving the problem independently before reviewing the provided solution.

Suppose thatA(p)takes a vectorp∈R n−1 and returns the n×ntridiagonal real-symmetric matrix

In this context, we define a scalar-valued function $ f(p) $ as $ g(p) = c^T A(p)^{-1} b $ for constant vectors $ b $ and $ c $ in $ \mathbb{R}^n $, assuming that $ A $ is invertible It is important to note that, in practice, $ A(p)^{-1} b $ is not computed by directly inverting the matrix $ A $; instead, it can be efficiently calculated using Gaussian elimination, which requires approximately $ \Theta(n) $ arithmetic operations.

“sparsity” ofA (the pattern of zero entries), a “tridiagonal solve.”

To compute the derivative $\frac{\partial g}{\partial p_1}$ in terms of matrix–vector products and matrix inverses, first express $g$ in relation to $A$ Once $g$ is defined in terms of $A$, you can obtain $\frac{\partial g}{\partial p_1}$ by dividing both sides by $\frac{\partial p_1}{\partial p_1}$, which allows you to replace $dA$ with $\frac{\partial A}{\partial p_1}$.

(b) Outline a sequence of steps to compute both g and ∇g (with respect to p) using only two tridiagonal solvesx=A −1 band an “adjoint” solvev=A −1 (something), plusΘ(n)(i.e., roughly proportional to n) additional arithmetic operations.

To implement the ∇g procedure, write a program in your preferred language, such as Julia, Python, or Matlab It's not necessary to use advanced tridiagonal solvers; you can utilize basic matrix libraries to compute $ A^{-1} $ inefficiently if needed Conduct a finite-difference test by selecting random values for $ a, b, c, $ and $ p $, and verify that $ \nabla g \delta p \approx g(p + \delta p) - g(p) $ holds true for a small randomly chosen $ \delta p $, ensuring the results are accurate to a few decimal places.

In Problem 38(a), we apply the chain rule along with the differential formula for a matrix inverse, leading to the expression \$dg = -2(c^T A^{-1} b) c^T A^{-1} dA A^{-1} b\$ It is important to note that since \$c^T A^{-1} b\$ is a scalar, we can rearrange it as necessary.

∂p 1 x= v1x2+v2x1 , where we have simplified the result in terms of xandv for the next part.

In Problem 38(b), we utilize the property that $ A^T = A $ to select $ v = A^{-1}[-2(c^T x)c] $, which simplifies to a single tridiagonal solve With $ x $ and $ v $ determined from our two $ \Theta(n) $ tridiagonal solves, we can compute each component of the gradient using the formula $ \frac{\partial g}{\partial p_k} = v_k x_{k+1} + v_{k+1} x_k $ for $ k = 1, \ldots, n-1 $ This computation requires $ \Theta(1) $ arithmetic operations per $ k $, resulting in a total of $ \Theta(n) $ arithmetic operations to derive the entire gradient $ \nabla g $.

Problem38(c) Solution: See theJulia solution notebook (Problem 1)from our IAP 2023 course (which calls the function f rather thang).

7 Derivative of Matrix Determinant and Inverse

Applications

Automatic Differentiation via Dual Numbers

Automatic Differentiation via Computational Graphs

Forward- vs Reverse-mode Differentiation

Sensitivity analysis of ODE solutions

Example

Hessians and optimization

Differentiating on the Unit Sphere

Differentiating on Orthogonal Matrices

Tiêu đề	Matrix calculus (for machine learning and beyond)
Tác giả	Paige Bright, Alan Edelman, Steven G. Johnson
Trường học	Massachusetts Institute of Technology
Chuyên ngành	Machine Learning
Thể loại	Notes
Năm xuất bản	2023
Thành phố	Cambridge

Định dạng
Số trang	102
Dung lượng	1,61 MB