All important cheat sheets for data science, ML, RL AI

All important cheat sheets 25032021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn wwwdin cominabhishek prasad ap D A T A 10 10 B r o w n U n iv e r s it y S a m u e l S W a t s o n Sets.All important cheat sheets 25032021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn wwwdin cominabhishek prasad ap D A T A 10 10 B r o w n U n iv e r s it y S a m u e l S W a t s o n Sets.

Trang 1

All important cheat sheets

25/03/2021 COMPILED BY ABHISHEK PRASAD

Follow me on LinkedIn: www.linkedin.com/in/abhishek-prasad-ap

Trang 2

1 A set is an unordered collection of objects The objects in

a set are called elements.

2 The cardinality of a set is the number of elements it

con-tains The empty set ∅ is the set with no elements.

3 If every element of A is also an element of B, then we say

Ais a subset of B and write A⊂ B If A ⊂ B and B ⊂ A,

then we say that A = B.

5 Two sets A and B are disjoint if A∩ B = ∅ (in other

words, if they have no elements in common).

6 A partition of a set is a collection of nonempty disjoint

subsets whose union is the whole set.

7 The Cartesian product of A and B is

A × B = {(a, b) : a ∈ A and b ∈ B

8 (De Morgan’s laws) If A, B⊂ Ω, then

(i) (A B) c = A c ∪ B c , and

(ii) (A B) c = A c ∩ B c

9 A list is an ordered collection of finitely many objects.

Lists differ from sets in that (i) order matters, (ii) repetition

matters, and (iii) the cardinality is restricted.

Sets and Functions: Functions

1 If A and B are sets, then a function f : A→ B is an

assignment of some element of B to each element of A.

2 The set A is called the domain of f and B is called the

codomain of f

3 Given a subset A 0 of A, we define the image of

f—denoted f (A 0 )—to be the set of elements which are

mapped to from some element in A 0

4 The range of f is the image of the domain of f

5 The composition of two functions f : A → B and

g : B → C is the function g ◦ f which maps a ∈ A to

g(f (a)) ∈ C.

6 The identity function on a set A is the function f : A→

A which maps each element to itself.

7 A function f is injective if no two elements in the domain

map to the same element in the codomain.

8 A function f is surjective if the range of f is equal to the

codomain of f

9 A function f is bijective if it is both injective and

surjec-tive If f is bijective, then the function from B to A that maps

b ∈ B to the element a ∈ A that satisfies f (a) = b is called

the inverse of f

10If f : A → B is bijective, then the function f −1 ◦ f is

equal to the identity function on A, and f ◦ f −1 is the

iden-tity function on B.

1 A value is a fundamental entity that may be manipulated

by a program Values have types; for example,5 is anIntand

"Hello world!" is aString.

2 A variable is a name used to refer to a value We can

assign a value5 to a variable x using x = 5

3 A function performs a particular task You prompt a function to perform its task by calling it Values supplied to a function are called arguments For example, in the function

call print ( 2 , 1 and 2 are arguments.

4 An operator is a function that can be called in a special

way For example, * is an operator since we can call the tiplication function with the syntax 3 * 5

mul-5 A statement is an instruction to be executed (likex = -3 ).

6 An expression is a combination of values, variables,

op-erators, and function calls that a language interprets and

evaluates to a value.

7 A numerical value can be either an integer or a float The

basic operations are + , / , and expressions are ated according to the order of operations.

evalu-8 Numbers can be compared using < , == , or ≥

9 Textual data is represented using strings.length (s) turns the number of characters in s The * operator concate- nates strings.

re-10 A boolean is a value which is either trueorfalse Booleans can be combined with the operators && (and), ||

13 The scope of a variable is the region in the program

where it is accessible Variables defined in the body of a tion are not accessible outside the body of the function.

func-14 Arrayis a compound data type for storing lists of jects Entries of an array may be accessed with square bracket

ob-syntax using an index or using a range objecta

A [ -5 , 2 ] A[ 2 A[ end]

15 An array comprehension can be used to generate new

arrays: [k fork =1:10ifmod (k, 2 == 0 ]

16 A dictionary encodes a discrete function by storing

input-output pairs and looking up input values when dexed This expression returns [ 0 1.0 ] :

in-Dict ( "blue" => [ 0 1.0 ], "red" => [ 1.0 , 0 ])[ "blue" ]

and evaluates them alternatingly until the conditional pression returns false Aforloop evaluates its body once

ex-for each entry in a given iterator (ex-for example, a range, array,

or dictionary) Each value in the iterator is assigned to a loop variable which can be referenced in the body of the loop.

whilex > 0

x -= 1

end

fori =1:10 print (i)

end

1 A vector inRn is a column of n real numbers, also written

as [v 1 , , v n ] A vector may be depicted as an arrow from

the origin in n-dimensional space The norm of a vector v is

the length q

v 2 + · · · +v 2 of its arrow.

2 The fundamental vector space operations are vector

ad-dition and scalar multiplication.

3 A linear combination of a list of vectors v1, , vk is an expression of the form

c1v 1+ c2v 2+ · · · + ckv k, where c 1 , , ckare real numbers The c’s are called the

weights of the linear combination.

4 The span of a list L of vectors is the set of all vectors which

can be written as a linear combination of the vectors in L.

5 A list of vectors is linearly independent if and only if the

only linear combination which yields the zero vector is the one with all weights zero.

6 A vector space is a nonempty set of vectors which is

closed under the vector space operations.

7 A list of vectors in a vector space is a spanning list of that

vector space if every vector in the vector space can be written

as a linear combination of the vectors in that list.

8 A linearly independent spanning list of a vector space is

called a basis of that vector space The number of vectors in

a basis of a vector space is called the dimension of the space.

9 A linear transformation L is a function from a vector

space V to a vector space W which satisfies L(cv +βw) =

cL(v) + L(w) for all c ∈ R, u, v ∈ These are “flat maps”:

equally spaced lines are mapped to equally spaces lines or points Examples: scaling, rotation, projection, reflection.

10 Given two vector spaces V and W, a basis { 1, , vn }

of V, and a list {w1, , wn }of vectors in W, there exists one

and only one linear transformation which maps v1to w1, v2

to w2 , and so on.

11 The rank of a linear transformation from one vector

space to another is the dimension of its range.

12 The null space of a linear transformation is the set of

vectors which are mapped to the zero vector by the linear transformation.

13 The rank of a transformation plus the dimension of its

null space is equal to the dimension of its domain (the

rank-nullity theorem).

Linear Algebra: Matrix Algebra

1 The matrix-vector product Ax is the linear combination

of the columns of A with weights given by the entries of x.

2 Linear transformations from Rnto Rmare in one-to-one correspondence with m × n matrices.

3 The identity transformation corresponds to the identity

matrix, which has entries of 1 along the diagonal and zero

entries elsewhere.

4 Matrix multiplication corresponds to composition of

the corresponding linear transformations: AB is the matrix for which (AB)(x) = A(Bx)for all x.

5 A m × nmatrix is full rank if its rank is equal to

min(m, n)

6 Ax=b has a solution x if and only if b is in the span of the columns of A If Ax=bdoes have a solution, then the solution is unique if and only if the columns of A are linearly

independent If Ax=bdoes not have a solution, then there

is a unique vector x which minimizes|Ax −b| 2

7 If the columns of a square matrix A are linearly

indepen-dent, then it has a unique inverse matrix A−1 with the

prop-erty that Ax=b implies x= A−1b for all x and b.

8 Matrix inversion satisfies (AB)−1= B−1A−1if A and B are both invertible.

9 The transpose A0 of a matrix A is defined so that the rows

of A 0 are the columns of A (and vice versa).

10 The transpose is a linear operator: (cA+ B)0= cA0+ B0

if c is a constant and A and B are matrices.

11 The transpose distributes across matrix multiplication but with an order reversal: (AB) 0 = B 0 A 0 if A and B are matrices for which AB is defined.

15 det AB = det A det B and det A −1 = (det A)−1.

16 A square matrix is invertible if and only if its nant is nonzero.

determi-Linear Algebra: Orthogonality

1 The dot product of two vectors inRnis defined by

x·y= x 1 y 1 + x 2 y 2 + · · · + x n y n

2 x·y= kxkky cosθ, where x, y ∈ R nand θ is the angle

between the vectors.

3 x·y= 0if and only if x and y are orthogonal.

4 The dot product is linear: x· (cy +z) =cx·y+x·z.

5 The orthogonal complement of a subspace V⊂ R n is the set of vectors which are orthogonal to every vector in V.

6 The orthogonal complement of the span of the columns

of a matrix A is equal to the null space of A 0

7 rank A = rank A0A for any matrix A.

8 A list of vectors satisfying vi ·vj = 0 for i 6= jis

orthog-onal An orthogonal list of unit vectors is orthonormal.

9 Every orthogonal list is linearly independent

10 A matrix U has orthonormal columns if and only if

U0U = I A square matrix with orthonormal columns is

called orthogonal An orthogonal matrix and its transpose

are inverses.

11 Orthogonal matrices represent rigid transformations

(ones which preserve lengths and angles).

12 If U has orthonormal columns, then UU 0 is the matrix which represents projection onto the span of the columns of U.

Linear Algebra: Spectral Analysis

1 An eigenvector v of an n× n matrix A is a nonzero

vec-tor with the property that Av=λv for some λ∈ R We call

λ an eigenvalue.

If v is an eigenvector of A, then A maps the line span({ v})

to itself:

Trang 3

3 Not every n × n matrix A has n linearly independent

eigenvectors If A does have n linearly independent

eigen-vectors, we can make a matrix V with these eigenvectors as

columns and get

where V is an orthogonal matrix (the spectral theorem).

6 A symmetric matrix is positive semidefinite if its

eigen-values are all nonnegative We define the square root of a

positive semidefinite matrix A = VΛV 0 to be V√ΛV 0 , where

√

Λ is obtained by applying the square root function

elemen-twise.

Linear Algebra: SVD

1 The Gram matrix A0 A of any m × n matrix A is positive

semidefinite Furthermore, |√A 0Ax| = |Ax|for all x∈ R n

2 The singular value decomposition is the factorization of

any rectangular m × nmatrix A as UΣV 0 , where U and V are

orthogonal and Σ is an m × n diagonal matrix (with diagonal

entries in decreasing order).

3 The diagonal entries of Σ are the singular values of A,

and the columns of U and V are called left singular vectors

and right singular vectors, respectively A maps each right

singular vector vito the corresponding left singular vector

uiscaled by σi

4 The vectors in Rnstretched the most by A are the ones

which run in the direction of the column or columns of V

corresponding to the greatest singular value Same for least.

5 For k ≥ 1, the k-dimensional vector space with minimal

sum of squared distances to the columns of A (interpreted

as points inRm ) is the span of the first k columns of U.

6 The absolute value of the determinant of a square matrix

is equal to the product of its singular values.

1 A sequence of real numbers(x n ) ∞

n=1=x 1 , x 2,

con-verges to a number x∈ R if the distance from x n to x on the number line can be made as small as desired by choosing n sufficiently large We say lim n→ ∞ x n = x or x n → x.

2 (Squeeze theorem) If an ≤ b n ≤ c n for all n ≥ 1 and if lim n→ ∞ a n = lim n→ ∞ c n = b, then b n → b as n → ∞.

3 (Comparison test) If ∑∞

n=1 b n converges and if |a n | ≤ b n for all n, then ∑ ∞

n=1 anconverges if and only if −1 < a < 1.

5 The Taylor series, centered at c, of an infinitely

differen-tiable function f is defined to be

f (c) + f0(c)(x − c) +f

00 (c) 2! (x−c)

2 + f000(c) 3! (x−c)

8 Given f :Rn→ R m, we define ∂f/∂x to be the matrix

whose (i, j)th entry is ∂ f i/∂xj Then (i) ∂

∂x(Ax) = A (ii) ∂

∂x(x0A) = A0(iii)∂

∂x(u0v) =u0∂v

∂x+v0∂u

∂x.

9 A function of two variables is differentiable at a point if

its graph looks like a plane when you zoom in sufficiently

around the point More generally, a function f :Rn→ R m is

differentiable at x if it is well-approximated by its derivative near x:

10 The HessianH of f : Rn → R is the matrix

of its second order derivatives: Hi,j(x) = ∂

(i) f realizes an absolute maximum and absolute minimum

on D (the extreme value theorem).

(ii) Any point where f realizes an extremum is either a critical point—meaning that ∇f = 0 or f is non- differentiable at that point—or at a point on the boundary.

(iii) (Lagrange multipliers) If f realizes an extremum at a

point on a portion of the boundary which is the level set

of a differentiable function g with non-vanishing ent ∇g, then either f is non-differentiable at that point

gradi-or the equation

∇f =λ∇g

is satisfied at that point, for some λ∈ R.

12 If r :R1 → R 2 and f :R2 → R 1 , then d

D f (x, y)dx dy can be interpreted as the mass of an object

occupying the region D and having mass density f(x, y) at each point (x, y).

14 Double integration over D: the bounds for the outer tegral are the smallest and largest values of y for any point

in-in D, and the bounds for the in-inner in-integral are the smallest and largest values of x for any point in a given “y = constant”

slice of the region.

15 Polar integration over D: the outer integral bounds are

the least and greatest values of θ for a point in D, and the

inner integral bounds are the least and greatest values of r

for any point in D along each given “θ= constant” ray The area element is dA = r dr dθ.

Numerical Computation: machine arithmetic

1 Computers store numerical values as sequences of bits.

The type of a numeric value specifies how to interpret the

underlying sequence of bits as a number.

2 TheInt64type uses 64 bits to represent the integers from

−2 63 to 2 63 − 1 For 0 ≤ n ≤ 2 63 − 1, we represent n using its binary representation, and for 1 ≤ n ≤ 263, we represent−n using the binary representation of 2 64 − n.Int64arithmetic

is performed modulo 2 64

3 TheFloat64type uses 64 bits to represent real numbers.

We call the first bit σ, the next 11 bits (interpreted as a binary

integer) e ∈ [0, 2047], and the final 52 bits f ∈ [0, 2 52 − 1] If

e / ∈ {0, 2047}, then the number represented by (σ, e, f) is

x = (−1)σ2 e−1023 1 + f 1

2

52 !

The representable numbers between consecutive powers of

2 are the ones obtained by 52 recursive iterations of binary subdivision The value of e indicates the powers of 2 that

x is between, and the value of f indicates the position of x between those powers of 2.

TheFloat64exponent value e = 2047 is reserved for Inf and NaN , while e = 0is reserved for the subnormal numbers:

representable valuelargest finite

4 TheBigIntandBigFloatare types use an arbitrary ber of bits and can handle very large numbers or very high precision Computations are much slower than for 64-bit types.

num-Numerical Computation: Error

1 If A bis an approximation for A, then the relative error is

b A−A A

2 Roundoff error comes from rounding numbers to fit

them into a floating point representation.

3 Truncation error comes from using approximate

math-ematical formulas or algorithms.

4 Statistical error arises from using randomness in an

ap-proximation.

5 The condition number of a function measures how it

stretches or compresses relative error The condition

num-ber of a problem is the condition numnum-ber of the map from

the problem’s initial data a to its solution S(a):

a−b (so subtracting b is

ill-conditioned near b—this is called catastrophic

9 An algorithm which solves a problem with error much

greater than κ²machis unstable An algorithm is unstable if

at least one of the steps it performs is ill-conditioned If every step of an algorithm is well-conditioned, then the algorithm

is stable.

10 The condition number of an matrix A is defined to be

the maximum condition number of the function x7→Ax

over its domain The condition number is equal to the ratio

of the largest to the smallest singular value of A.

Numerical Computation: PRNGs

1 A pseudorandom number generator (PRNG) is an

al-gorithm for generating a deterministic sequence of numbers which is intended to share properties with a sequence of ran-

dom numbers The PRNG’s initial value is called its seed.

2 The linear congruential generator: fix positive integers

M, a, and c, and consider a seed X 0 ∈ {0, 1, , M − 1} We return the sequence X 0 , X 1 , X 2 , , where X n = mod(aXn−1+ c, M) for n ≥ 1.

3 The period of a PRNG is the minimum length of a

repeat-ing block A long period is a desirable property of a PRNG, and a very short period is typically unacceptable.

4 Frequency tests check whether blocks of terms appear

with the appropriate frequency (for example, we can check whether a 2n > a 2n−1 for roughly half of the values of n).

Numerical Computation: Automatic Differentiation

1 A dual number is an object that can be substituted into

a function f to yield both the value of the function and its derivative at a point x.

2 If f is a function which can act on matrices, then h

x 1

0 x i represents a dual number at x, since f h x 1

0 x i

be checked to hold for f + g and f g whenever it holds for f and g, and it holds for the identity function).

3 To find the derivative of f with automatic differentiation, every step in the computation of f must be dual-number- aware See the packages ForwardDiff (for Julia) and autograd (for Python).

Numerical Computation: Optimization

1 Gradient descent seeks to minimize f :Rn→ R by peatedly stepping in f ’s direction of maximum decrease We

re-begin with a value x0∈ R n and repeatedly update using the

rule xn+1=xn −²∇f(xn−1), where ² is the learning rate.

We fix a small number τ> 0 and stop when |∇f(xn )| <τ.

2 A function is strictly convex if its Hessian is positive

semidefinite everywhere A strictly convex function has at most one local minimum, and any local minimum is also a global minimum Gradient descent will find the global minimum for a convex function, but for non-convex functions it can get stuck in a local minimum.

3 Algorithms similar to gradient descent but with usually

faster convergence: conjugate gradient, BFGS, L-BFGS.

Trang 4

1 Fundamental principle of counting: If one experiment

has m possible outcomes, and if a second experiment has n

possible outcomes for each of the outcomes in the first

ex-periment, then there are mn possible outcomes for the pair

(n−r)!ordered r-tuples of distinct elements of S.

4 Combinations: The number of r-element subsets of an

n-element set is (nr) = n!

r!(n−r)!

Probability: Probability Spaces

1 Given a random experiment, the set of possible outcomes

is called the sample space Ω, like{( H , H ), ( H , T ), ( T , H ), ( T , T )}.

2 We associate with each outcome ω∈Ω a probability

mass, denoted m(ω) For example, m (( H , T )) = 1

3 In a random experiment, an event is a predicate that can

be determined based on the outcome of the experiment (like

“first flip turned up heads”) Mathematically, an event is a

subset of Ω (like {( H , H ), ( H , T )}).

4 Basic set operations ∪, ∩, and c correspond to

disjunc-tion, conjuncdisjunc-tion, and negation of events:

(i) The event that E happens or F happens is E ∪ F.

(ii) The event that E happens and F happens is E ∩ F.

(iii) The event that E does not happen is E c

5 If E and F cannot both occur (that is, E ∩ F = ∅), we say

that E and F are mutually exclusive or disjoint.

6 If E’s occurrence implies F’s occurrence, then E ⊂ F.

7 The probabilityP(E)of an event E is the sum of the

prob-ability masses of the outcomes in that event The domain of

P is 2 Ω , the set of all subsets of Ω.

8 The pair ( Ω,P) is called a probability space The

funda-mental probability space properties are

(i) P(Ω ) = 1 — “something has to happen”

(ii) P(E) ≥ 0 — “probabilities are non-negative”

(iii) P(E ∪ F) = P(E) + P(F) if E and F are mutually

exclu-sive — “probability is additive”.

9 Other properties which follow from the fundamental

ones:

(i) P(∅ ) = 0

(ii) P(E c ) = 1 − P(E)

(iii) E ⊂ F =⇒ P(E) ≤ P(F) (monotonicity)

(iv) P(E ∪ F) = P(E) + P(F) − P(E ∩ F) (principle of

inclusion-exclusion).

Probability: Random Variables

1 A random variable is a number which depends on the

result of a random experiment (one’s lottery winnings, for

example) Mathematically, a random variable is a function

X from the sample space Ω to R.

2 The distribution of a random variable X is the

probabil-ity measure on R which maps each set A ⊂ R to P(X ∈ A).

The probability mass function of the distribution of X may

be obtained by pushing forward the probability mass from

each ω∈ Ω:

3 The cumulative distribution function (CDF) of a

ran-dom variable X is the function F X (x) = P(X ≤ x).

m X (x) 1

F X (x) 1

4 The joint distribution of two random variables X and

Y is the probability measure on R2which maps A ⊂ R 2 to P((X, Y) ∈ A) The probability mass function of the joint distribution is m(X,Y)(x, y) = P(X = x and Y = y).

Probability: Conditional Probability

1 Given a probability space Ω and an event E ⊂Ω, the

con-ditional probability measure given E is an updated

proba-bility measure on Ω which accounts for the information that

the result ω of the random experiment falls in E:

P(F | E) =P(F∩E)

P(E)

2 The conditional probability mass function of Y given {X = x} is mY | X=x(y) = mX,Y(x, y)/m X (x).

3 Bayes’ theorem tells us how to update beliefs in light

of new evidence It relates the conditional probabilities P(A | E ) and P(E | A ) :

P( A | E) =P(E|A)P(A)

P(E) =

P(E | A)P(A) P(E | A)P(A) + P(E | A c )P(A c )

4 Two events E and F are independent ifP(E ∩ F) = P(E)P(F).

5 Two random variables X and Y are independent if the

every pair of events of the form {X ∈ A} and {Y ∈ B are independent, where A ⊂ R and B ⊂ R.

6 The PMF of the joint distribution of a pair of independent random variables factors as mX,Y(x, y) = m X (x)m Y (y):

1 The expectationE[X](or mean µX ) of a random variable

Xis the probability-weighted average of X:

E[X] =∑

ω∈ Ω

X(ω)m(ω)

2 The expectation E[X] may be thought of as the value of

a random game with payout X, or as the long-run average

of X over many independent runs of the underlying

experi-ment The Monte Carlo approximation ofE[X] is obtained

by simulating the experiment many times and averaging the value of X.

3 The expectation is the center of mass of the distribution

of X:

4 The expectation of a function of a discrete random able (or two random variables) may be expressed in terms of the PMF m X of the distribution of X (or the PMF m(X,Y)of the joint distribution of X and Y):

vari-E[g(X)] = ∑

x∈R g(x)m X (x) E[g(X, Y)] = ∑

(x,y)∈R2 g(x, y)m(X,Y)(x, y).

5 Expectation is linear: if c∈ R and X and Y are random variables defined on the same probability space, then

E[cX + Y] = cE[X] + E[Y]

6 The variance of a random variable is its average squared

deviation from its mean The variance measures how spread

out the distribution of X is The standard deviation σ(X)is the square root of the variance.

7 Variance satisfies the properties, if X and Y are dent random variables and a ∈ R:

indepen-Var(aX) = a 2 Var X Var(X + Y) =Var(X) + Var(Y)

8 The covariance of two random variables X and Y is the

expected product of their deviations from their respective

means µX = E[X]and µY= E[Y]:

Cov(X, Y) = E[(X −µX )(Y −µY )] = E[XY] − E[X]E[Y].

9 The covariance of two independent random variables is zero, but zero covariance does not imply independence.

10 The correlation of two random variables is their

nor-malized covariance:

Corr(X, Y) =Covσ(X)σ(Y)(X, Y)∈ [−1, 1]

11 The covariance matrix of a vector X= [ X 1 , , X n ] of random variables defined on the same probability space is defined to be the matrix Σ whose (i, j)th entry is equal to Cov(X i , X j ) If E[X] = 0, then Σ = E[XX0].

Probability: Continuous Distributions

1 If Ω ⊂ R n and P(A) = R

A f , where f ≥ 0 and R

Rn f = 1, then we call ( Ω, P)a continuous probability space.

x

a b

f ( x )

P([a ,b])

2 The function f is called a density, because it measures

the amount of probability mass per unit volume at each point (2D volume = area, 1D volume = length).

3 If (X, Y) is a pair of random variables whose joint bution has density fX,Y: R2→ R, then the conditional distribution of Y given the event {X = x} has density f Y | X=x defined by

distri-fY | {X=x}(y) =fX,Y(x, y)

f X (x) ,where fX(x) =

Z ∞

− ∞f(x, y)dyis the PDF of X.

4 If a random variable X has density f X on R, then

E[g(X)] = Z R g(x)f X (x)dx.

5 CDF sampling: F−1 (U) has CDF F if f U = 1[0,1].

Probability: Conditional Expectation

1 The conditional expectation of a random variable given

an event is the expectation of the random variable calculated with respect to the conditional probability measure given that event: if (X, Y) has PMF mX,Y, then

E[Y | X = x] =∑

y∈R

ymY | X=x(y), where mY | X=x(y) = mX,Y (x,y)

mX (x) If (X, Y) has pdf fX,Y, then E[Y |X = x] =

Z R

y fY | X=x(y)dy.

2 The conditional expectation of a random variable Y given another random variable X is obtained by substituting X for

x in the expression for the conditional expectation of Y given

X = x Thus E[Y | X] is a random variable.

3 If X and Y are independent, then E[Y |X] = E[Y] If Z is

a function of X, thenE[ZY |X] = ZE[Y| X].

4 The law of iterated expectation:E[E[Y |X]] = E[Y].

Probability: Common Distributions

1 Bernoulli (Ber(p) ): A weighted coin flip.

1 0

m() = (n

k)p k(1−p)n−k

Trang 5

3 Geometric (Geom(p)): Time to first success (1) in a

se-quence of independent Ber(p)’s.

=λ

5 Exponential distribution (Exp(λ)): Limit as n → ∞ of

distribution of 1/n times a Geometric (λ/n)

se-ables (i.i.d.) with E[X 1 ] =µ and Var(X1 ) =σ2 < ∞ (see

Central Limit Theorem).

7 Multivariate normal distribution (N (0,Σ )): if Z =

2 A sequence ν1 ,ν2 , of probability measures on Rn

con-verges to a probability measure ν if νn (A) →ν(A) for every set A ⊂ R nwith the property that ν(∂A) =0 (intuitively, two measures are close if they put approximately the same amount of mass in approximately the same places) We say

X nconverges in distribution to ν if the distribution of Xn

converges to ν.

3 Chebyshev’s inequality: if X is a random variable with

variance σ2 < ∞, then X differs from its mean by more than

k standard deviations with probability at most k −2 :

6 We define the standardized running sum of X1, X2,

to have zero mean and unit variance for all n ≥ 1:

S∗= X 1 + X 2 + · · · +X n − nµ

σ√n

7 Central limit theorem: the sequence of standardized

sums of an i.i.d sequence of finite-variance random variables converges in distribution to N (0, 1): for any interval [a, b],

we have P(S∗∈ [a, b]) →Zb

a 1

√ 2πe

9 The central limit theorem explains the ubiquity of the normal distribution in statistics: many random quantities may be realized as a sum of a multitude of independent con- tributions.

1 Statistical learning: Given some samples from a

proba-bility space with an unknown probaproba-bility measure, we seek

to draw conclusions about the measure.

2 Supervised learning:(X, Y) is drawn from an unknown probability measure P on a product space X × Y , and we

aim to predict Y given X, based on a i.i.d collection of

sam-ples from P(the training data).

Example: X= [X 1 , X 2 ], where X 1is the color of a banana, X2

is the weight of the banana, and Y is a measure of deliciousness.

Values of X1 , X 2, and Y are recorded for many bananas, and they

are used to predict Y for other bananas whose X values are known.

3 We call the components of X features, predictors, or input

variables, and we call Y the response variable or output variable.

4 A supervised learning problem is a regression problem

if Y is quantitative (Y ⊂ R) and a classification problem if

Y is a set of labels.

5 To choose a prediction function h : X → Y, we specify a (i) a space H of candidate functions, and

(ii) a loss (or risk) functional L fromH to R.

The target function is argminh∈HL(h).

6 If the loss functional for a regression problem is

L(h) = E[(h(X) − Y) 2 ] and H contains r(x) = E[Y |X=x], then r is the target function If the loss functional for a classification problem is

L(h) = Eh1{h(X)6=Y}i, and H contains G(x) = argmaxcP(Y = c|X=x), then G is the target function.

7 Since P is unknown, we must approximate the target function with a function b h whose values can be computed

from the training data A learner is a function which takes a

set of training data and returns a prediction function b h.

8 The empirical probability measure onX × Yis the sure which assigns a probability mass of 1 to the location

mea-of each training sample (X1 , Y 1 ) , (X2 , Y 2 ) , , (Xn , Y n ) The

empirical risk of a candidate function h is the risk functional

evaluated with respect to the empirical measure of the

train-ing data The empirical risk minimizer (ERM) is the

func-tion which minimizes empirical risk.

9 Generalization error (or test error) is the difference

be-tween empirical risk and the actual value of the risk tional.

func-10 The ERM can overfit, meaning that test error and L(bh) are large despite small empirical risk.

Example: ifHis the space of polynomials and no two training

samples have the same x values, then then there are functions inH

which have zero empirical risk.

risk minimizer

empirical risk minimizer

x y

11 Mitigate overfitting with inductive bias:

(i) Use a restrictive class H of candidate functions.

(ii) Regularize: add a term to the loss functional which

penalizes complexity.

12 Inductive bias can lead to underfitting: relevant

rela-tions are missed, so both training and test error are larger than necessary The tension between the costs of high inductive bias and the costs of low inductive bias is called the

bias-complexity (or bias-variance) tradeoff.

13 No-free-lunch theorem: all learners are equal on

aver-age (over all possible problems), so inductive bias ate to a given type of problem is essential to have an effective learner for that type of problem.

appropri-Statistical Learning: Kernel density estimation

1 Given n samples X 1 , , X n from a distribution with density f on R, we can estimate the PDF of the distribution by placing 1/n units of probability mass in a small pile around each sample.

2 We choose a kernel function for the shape of each pile:

total mass = 1

3 The width of each pile is specified by a bandwidth λ:

Dλ(u) = 1

λD uλ.

4 The kernel density estimator with bandwidth λ is the

sum of the piles at each sample:

b

fλ(x) =1n

n

∑

i=1

Dλ(x − X i ).

5 To choose a suitable bandwidth, we seek to minimize the

integrated squared error (ISE) L(f ) = R

R (f − b f) 2

6 We approximate the minimizer of L with the minimizer

of the cross-validation loss estimator

J(f ) = Z R b

fλ2−2n

n

∑

i=1 b

fλ(−i)(X i ), where b fλ(−i)is the KDE with the ith sample omitted.

7 If f is a density on R2, then we use the KDE b

fλ(x, y) =1n

8 Stone’s theorem says that the ratio of the CV ISE to the

optimal-λ ISE converges to 1 in probability as n→ ∞ Also,

the optimal λ goes to 0 like 1

n1/5 , and the minimal ISE goes

to 0 like 1 n4/5

9 The Nadaraya-Watson nonparametric regression

estima-torb r(x)computes E[Y |X = x] with respect to the estimated density b fλ Equivalently, we average the Y i ’s, weighted according to horizontal distance from x:

Trang 6

1 Parametric regression uses a familyH of candidate

func-tions which is indexed by finitely many parameters.

2 Linear regression uses the set of affine functions:H =

{x7→β0 + [β1 , ,βp ] ·x:β0 , βp ∈ R}.

3 We choose the parameters to minimize a risk function,

customarily the residual sum of squares:

RSS(˛) =

n

∑

i=1 (y i −β0 −˛·xi )2= |y−X˛|2,

where y = [y 1 , , y n ], ˛ = [β 0 , ,β p ], and X is an

n × (p + 1 ) matrix whose ith row is a 1 followed by the

com-ponents of xi.

xy

4 The RSS minimizer is ˛= (X0X)−1X0Y.

5 We can use the linear regression framework to do

poly-nomial regression, since a polypoly-nomial is linear in its

coeffi-cients: we supplement the list of regressors with products of

the original regressors.

Statistical Learning: Optimal classification

1 Consider a classification problem with feature set X and

class set Y For each y ∈ Y, we define p y = P(Y = y) and

let f ybe the conditional PMF or PDF of X given{Y = y} (y’s

class conditional distribution).

2 Given a prediction function (or classifier) h and an

enu-meration of the elements of Y as {y 1 , y 2 , , y|Y |}, we define

the (normalized) confusion matrix of h to be the|Y | × |Y |

matrix whose (i, j)th entry is P(h(X) = y i |Y = y j ).

3 If Y = {−, +}, the conditional probability of correct

classification given a positive sample is the detection rate

(DR), while the conditional probability of incorrect

classifi-cation given a negative sample is the false alarm rate (FAR).

4 The precision of a classifier is the conditional

probabil-ity that a sample is positive given that the classifier predicts

positive, and recall is a synonym of detection rate.

5 The Bayes classifier G(x) = argmaxyp y f y (x) minimizes

the misclassification probability but gives equal weight to

both types of misclassification.

6 The likelihood ratio test generalizes the Bayes

classi-fier by allowing a variable tradeoff between false-alarm rate

and detection rate: given t > 0, we say h t (x) = − 1 if

f + (x)/ f − (x) < t and h t (x) = 1 otherwise.

7 The Neyman-Pearson lemma says that no classifier does

better on both false alarm rate and detection rate than h t

8 The receiver operating characteristic of ht is the curve

{(FAR(h t ), DR(h t )) : t ∈ [0,∞ ]}.

The AUROC (area under the

ROC) is close to 1 for an excellent

classifier and close to 1 for a

worthless one NP says that

no classifier is above the ROC.

We choose a point on the ROC

curve based on context-specific

1 Quadratic discriminant analysis (QDA) is a

classifica-tion algorithm which uses the training data to estimate the

mean —y and covariance matrix Σ y of each class conditional distribution:

b

—y = mean({xi: yi= y}) b

Σ y = mean ({(xi −—by)(xi −—by)0: y i = y}).

Each distribution is assumed to be multivariate normal (N (—by, Σ b y ) ) and the classifier h (y) = argmax y b p y f b y (x) is proposed (where { pby : y ∈ Y } are the class proportions from the training data).

2 Linear discriminant analysis (LDA) is the same as QDA

except the class covariance matrices are assumed to be equal and are estimated using all of the data, not just class-specific samples.

3 QDA and LDA are so named because they yield class prediction boundaries which are quadric surfaces and hy- perplanes, respectively.

4 A Naive Bayes classifier assumes that the features are

conditionally independent given Y:

f y (x 1 , , x p ) = fy,1(x 1 ) · · · fy,d(x p ), for some fy,1, f y,p

5 Example assumption-satisfying data sets:

QDA LDA Naive Bayes

Statistical Learning: Logistic regression

1 Logistic regression for binary classification estimates

r(x) = P(Y = 1 |X = x)as a logistic function of a linear

function of x:

b r(x) =σ(α+ ˛·x), where

1 − r(x i )

, which applies large penalty if yi= 1 and r(xi) is close to zero or if yi= 0 and r(xi) is close to 1.

3 L is convex, so it can be reliably minimized using ical optimization algorithms.

numer-Statistical Learning: Support vector machines

1 A support vector machine

(SVM) chooses a hyperplane

H ⊂ R p and predicts tion (Y = {−1,+1}) based on which side of H the feature vec-

classifica-tor x lies on.

2 x 7→ sgn(˛·x−α)is the prediction function, where

where [u] + denotes max(0, u), the positive part of u.

4 The parameters ˛ and α encode both H and the a

distance—called the margin—from H to a parallel

hyper-plane where we begin penalizing for lack of decisively rect classification The margin is 1/|˛| (and can be adjusted

cor-without changing H by scaling ˛ and α).

m gi

5 If λ is small, then the optimization prioritizes the rectness term and uses a small margin if necessary If λ is

cor-large, the optimization must minimize a large-margin

in-correctness penalty A value for λ may be chosen by

cross-validation.

6 Kernelization: mapping the feature vectors to a higher

dimensional space allows us to find nonlinear separating surfaces in the original feature space.

Statistical Learning: Neural networks

1 A neural network function N :Rp→ R q is a tion of affine transformations and componentwise applica- tions of a function K : R → R.

composi-(i) We call K the activation function Common choices:

(a) u 7→ max(0, u) (rectifier, or ReLU) (b) u 7→ 1/(1 + e−u) (logistic)

(ii) Component-wise application of K onRtrefers to the function K.(x 1 , , x t ) = (K(x 1 ), , K(x t )).

(iii) An affine transformation from Rtto Rsis a map of the form A(x) =Wx+b, where W is an s × t ma-

trix and b∈ R s Entries of W are called weights and entries of b are called biases.

 1

−2 3



input (xi ∈ R p )

−3 4

0



 3

= 5 cost

2 The architecture of a neural network is the sequence of

dimensions of the domains and codomains of its affine maps.

For example, a neural net with W1∈ R 5×3 , W 2 ∈ R 4×5 , and

W 3 ∈ R 1×4 has architecture [3, 5, 4, 1].

3 Given training samples {(xi, yi)} N

i=1 , we obtain a neural net regression function by minimizing L(N) = n

∑

i=1 C(N(xi ), yi) where C(y , yi) = |y−yi | 2

4 For classification, we

(i) let yi = [0, , 0, 1, 0, 0] ∈ R|Y |, with the location of

the nonzero entry indicating class (this is called one-hot

encoding),

(ii) replace the identity map in the diagram with the

soft-max function u7→he uj/ ∑ n

k=1 e uk |Y | j=1 , and (iii) replace the cost function with C(y , yi ) = −log(y·yi ).

5 When the weight matrices are large, they have many rameters to tune We use a custom optimization scheme:

pa-(i) Start with random weights and a training input xi

(ii) Forward propagation: apply each successive map and

store the vectors at each green or purple node The

vec-tors stored at the green nodes are called activations.

(iii) Backpropagation: starting with the last green node

and working left, compute the change in cost per small change in the vector at each green or purple node By the chain rule, each such gradient is equal to the gradient computed at right-adjacent node times the derivative of map between the two nodes The derivative of A j

is W j, and the derivative of K is v7→ diagdKdu

(v) (iv) Compute the change in cost per small change in the weights and biases at each blue node Each such gradient is equal to the gradient stored at the next purple node times the derivative of the intervening affine map.

We have∂(Wx+b)

∂b = Iand v∂(Wx+b)

∂W =v0x0.

(v) Stochastic gradient descent: repeat (ii)–(iv) for each

sample in a randomly chosen subset of the training set and

determine the average desired change in weights and ases to reduce the cost function Update the weights and biases accordingly and iterate to convergence.

bi-Statistical Learning: Dimension reduction

1 The goal of dimension reduction is to map a set of n

points inR p to a lower-dimensional spaceR k while retaining

as much of the data’s structure as possible.

2 Dimension reduction can be used as a visualization aid

or as a feature pre-processing step in a machine learning model.

3 Structure may be taken to mean variation about the center,

in which case we use principal component analysis:

(i) Store the points’ components in an n × p matrix, (ii) de-mean each column,

(iii) compute the SVD UΣV 0 of the resulting matrix, and (iv) let W be the first k columns of V.

Then WW 0 : Rp→ R p is the rank-k projection matrix which minimizes the sum of squared projection distances of the points, and W 0 : Rp→ R k maps each point to its coordinates

in that k-dimensional subspace (with respect to the columns

of W).

-4 -2 0 2

2 4 6 8

MNIST handwritten digit images

first two principal components

4 Structure may be taken to mean pairwise proximity of

points, which stochastic neighbor embedding attempts to

preserve Given the data points x1, , xn and a parameter

ρ called the perplexity of the model, we define

Pi,j(σ) = e−|xi−xj|

2 /(2σ2 )

∑ k6=j e−|xk−xj|2/(2σ2 ),

Trang 7

2n (Pi,j(σj ) +Pj,i(σi )), which describes the

similarity of xiand xj Given y1, , yn in Rk, we define

MNIST handwritten

digit images

Statistics: Point estimation

1 The central problem of statistics is to make inferences

about a population or data-generating process based on the

information in a finite sample drawn from the population.

2 Parametric estimation involves an assumption that the

distribution of the data-generating process comes from a

family of distributions parameterized by finitely many real

numbers, while nonparametric estimation does not

Exam-ples: QDA is a parametric density estimator, while kernel density

estimation is nonparametric.

3 Point estimation is the inference of a single real-valued

feature of the distribution of the data-generating process

(such as its mean, variance, or median).

4 A statistical functional is any function T from the set of

distributions to [− ∞, ∞ ] An estimatorbθ is a random

vari-able defined in terms of n i.i.d random varivari-ables, the

pur-pose of which is to approximate some statistical functional of

the random variables’ common distribution Example:

Sup-pose that T( ν)= the mean of ν, and that θb = (X 1 + · · · + X n )/n.

5 The empirical measure bν of X1 , , X n is the

probabil-ity measure which assigns mass 1 to each sample’s location.

The plug-in estimator of θ= T(ν) is obtained by applying

T to the empirical measure: bθ= T( bν).

6 Given a distribution ν and a statistical functional T, let

θ= T(ν) The bias of an estimator of θ is the difference

be-tween the estimator’s expected value and θ Example: The

expectation of the sample meanbθ = (X1+ · · · + X n )/n is

E(X 1 + · · · + X n )/n = E[ν ], so the bias of the sample mean

10MSE is equal to variance plus squared bias Therefore,

MSE converges to zero as the number of samples goes to ∞

if and only if variance and bias both converge to zero.

1 Consider an unknown probability distribution ν from

which we get n independent samples X 1 , , X n , and

sup-pose that θ is the value of some statistical functional of ν.

A confidence interval for θ is an interval-valued function of

the sample data X 1 , , X n A confidence interval has

con-fidence level 1−α if it contains θ with probability at least

1 −α.

2 If bθ is unbiased, thenθ−k se(bθ), bθ+ k se(bθ) is a 1 − 1

k2 confidence interval, by Chebyshev’s inequality.

3 If bθ is unbiased and approximately normally distributed,

then

θ− 1.96 se(bθ), b θ+ 1.96 se(bθ) is an approximate 95%

confidence interval, since 95% of the mass of the standard normal distribution is in the interval [−1.96, 1.96].

4 Let I ⊂ R, and suppose that T is a function from the set of distributions to the set of real-valued functions on I A 1−α

confidence band for T(ν) is pair of random functions y min and y max from I to R defined in terms of n independent sam-

ples from ν and having ymin≤ T(ν) ≤ y max everywhere on

I with probability at least 1 −α.

Statistics: Empirical CDF convergence

1 Statistics is predicated on the idea that a distribution is well-approximated by indepen-

dent samples therefrom The

in probability.

2 The Dvoretzky-Kiefer-Wolfowitz inequality (DKW)

says that the graph of F b nlies in the ²-band around the graph

of F with probability at least 1 − 2e −2n²2

.

Statistics: Bootstrapping

1 Bootstrapping is the use of simulation to approximate

the value of the plug-in estimator of a statistical functional

which is expressed in terms of independent samples from ν.

Example: if θ= T(ν)is the variance of the median of 3 dent samples from ν, then the bootstrap estimate of θ is obtained as

indepen-a Monte Cindepen-arlo indepen-approximindepen-ation of T(bν): we sample 3 times (with

re-placement) from{X1, , X n }, record the median, repeat B times

for B large, and take the sample variance of the resulting list of B numbers.

2 The bootstrap approximation of T( bν) may be made as close to T( bν) as desired by choosing B large enough The difference between T(ν) and T( bν) is likely to be small if n is

large (that is, if many samples from ν are available).

3 The bootstrap is useful for computing standard errors, since the standard error of an estimator is often infeasible to compute analytically but conducive to Monte Carlo approximation.

Statistics: Maximum likelihood estimation

1 Maximum likelihood estimation is a general approach for proposing an estimator Consider a parametric family {f„(x) : „∈ R d }of PDFs or PMFs Given x∈ R n , the

likelihoodLx: Rd→ R is defined by

Lx(„) = f„(x 1 )f„(x 2 ) · · · f„(x n ).

If X is a vector of n independent samples drawn from f„(x), then LX(„)is small or zero when „ is not in accordance with

the observed data.

Example: Suppose x7→ f (x;θ)is the density of a uniform random

variable on[0,θ ] We observe four samples drawn from this

distri-bution: 1.41, 2.45, 6.12, and 4.9 Then the likelihood of θ=5 is zero, and the likelihood of θ= 10 6is very small.

2 The maximum likelihood estimator is

· · · + (X n − X) 2 ) So the maximum likelihood estimators agree

with the plug-in estimators.

3 MLE enjoys several nice properties: under certain

regu-larity conditions, we have (stated for θ∈ R 1 ):

(i) Consistency:E[(θb −θ) 2 ] → 0 as the number of samples goes to ∞.

(ii) Asymptotic normality:(bθ−θ)/p

Var bθ converges to

N (0, 1) as the number of samples goes to ∞.

(iii) Asymptotic optimality: the MSE of the MLE converges

to 0 approximately as fast as the MSE of any other sistent estimator.

con-4 Potential difficulties with the MLE:

(i) Computational challenges It might be hard to work

out where the maximum of the likelihood occurs, either analytically or numerically.

(ii) Misspecification The MLE may be inaccurate if the

dis-tribution of the samples is not in the specified ric family.

paramet-(iii) Unbounded likelihood If the likelihood function is not

bounded, then bθ is not well-defined.

Statistics: Hypothesis testing

1 Hypothesis testing is a disciplined framework for

adju-dicating whether observed data do not support a given pothesis.

hy-2 Consider an unknown distribution from which we will observe n samples X 1 , X n

(i) We state a hypothesis H 0–called the null

hypothe-sis–about the distribution.

(ii) We come up with a test statistic T, which is a function

of the data X 1 , X n , for which we can evaluate the tribution of T assuming the null hypothesis.

dis-(iii) We give an alternative hypothesis Ha under which T is expected to be significantly different from its value under H 0

(iv) We give a significance level α (like 5% or 1%), and based

on H a we determine a set of values for T—called the

critical region—which T would be in with probability at most α under the null hypothesis.

(v) After setting H0, H a, ¸, T , and the critical region,

we run the experiment, evaluate T on the samples we get, and record the result as t obs

(vi) If t obs falls in the critical region, we reject the null

hy-pothesis The corresponding p-value is defined to be the

minimum α-value which would have resulted in

reject-ing the null hypothesis, with the critical region chosen

in the same way*.

Example: Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first She is given eight cups of tea, four poured milk-first and four poured tea-first.

We posit a null hypothesis that she isn’t able to discern the ing method, under which the number of cups identified correctly is

pour-4 with probability 1/(8 ) ≈1.4% and at least 3 with probability

17/70 ≈24% Therefore, at the 5% significance level, only a rect identification of all the cups would give us grounds to reject the null hypothesis The p-value in that case would be 1.4%.

cor-3 Failure to reject the null hypothesis is not necessarily

evi-dence for the null hypothesis The power of a hypothesis test

is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true A p-value may

be low either because the null hypothesis is true or because the test has low power.

4 The Wald test is based on the normal approximation.

Consider a null hypothesis θ= 0 and the alternative

hypoth-esis θ6= 0, and suppose thatθ is approximately normallyb distributed The Wald test rejects the null hypothesis at the 5% significance level if |bθ| >1.96 se(bθ).

5 The random permutation test is applicable when the

null hypothesis is that the mean of a given random variable

is equal for two populations.

(i) We compute the difference between the sample means for the two groups.

(ii) We randomly re-assign the group labels and compute the resulting sample mean differences Repeat many times.

(iii) We check where the original difference falls in the sorted list of re-sampled differences.

Example: Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches Consider the null hypothesis that the expected heights are the same for the two families, and the alternative hypothesis that the Romero sons are taller on average (with α=5%) We find that the sample mean difference of about 2.4 inches is larger than 88.5%

of the mean differences obtained by resampling many times Since 88.5% < 95%, we retain the null hypothesis.

6 If we conduct many hypothesis tests, then the ity of obtaining some false rejections is high ( xkcd.com/882 ).

probabil-This is called the multiple testing problem The Bonferroni

method is to reject the null hypothesis only for those tests

whose p-values are less than α divided by the number of

hy-pothesis tests being run This ensures that the probability

of having even one false rejection is less than α, so it is very

conservative.

Statistics: dplyr and ggplot2

1 dplyr is an R package for manipulating data frames The following functions filter rows, sort rows, select columns, add columns, group, and aggregate the columns of a grouped data frame.

summarise(avgspeed =mean(speed,na.rm =TRUE))

2 ggplot2 is an R package for data visualization Graphics

are built as a sum of layers, which consist of a data frame,

a geom, a stat, and a mapping from the data to the geom’s

aesthetics (likex , y , color , or size ) The appearance of the plot can be customized with scale s, coord s, and theme s.

Trang 8

Data Science Cheatsheet 2.0

Last Updated February 13, 2021

Statistics

Discrete Distributions

Binomial Bin(n, p) - number of successes in n events, each

with p probability If n = 1, this is the Bernoulli distribution

Geometric Geom(p) - number of failures before success

Negative Binomial NBin(r, p) - number of failures before r

successes

Hypergeometric HGeom(N, k, n) - number of k successes in

a size N population with n draws, without replacement

Poisson Pois(λ) - number of successes in a fixed time interval,

where successes occur independently at an average rate λ

Continuous Distributions

Normal/Gaussian N (µ, σ), Standard Normal Z ∼ N (0, 1)

Central Limit Theorem - sample mean of i.i.d data

approaches normal distribution

Exponential Exp(p) - time between independent events

occurring at an average rate λ

Gamma Gamma(p) - time until n independent events

occurring at an average rate λ

Hypothesis Testing

Significance Level α - probability of Type 1 error

p-value - probability of getting results at least as extreme as

the current test If p-value < α, or if test statistic > critical

value, then reject the null

Type I Error (False Positive) - null true, but reject

Type II Error (False Negative) - null false, but fail to reject

Power - probability of avoiding a Type II Error, and rejecting

the null when it is indeed false

Z-Test - tests whether population means/proportions are

different Assumes test statistic is normally distributed and is

used when n is large and variances are known If not, then use

a t-test Paired tests compare the mean at different points in

time, and two-sample tests compare means for two groups

ANOVA - analysis of variance, used to compare 3+ samples

with a single test

Chi-Square Test - checks relationship between categorical

variables (age vs income) Or, can check goodness-of-fit

between observed data and expected population distribution

Concepts

Learning

– Supervised - labeled data

– Unsupervised - unlabeled data

– Reinforcement - actions, states, and rewards

Cross Validation - estimate test error with a portion of

training data to validate accuracy and model parameters

– k-fold - divide data into k groups, and use one to validate

– leave-p-out - use p samples to validate and the rest to train

Parametric - assume data follows a function form with a

fixed number of parameters

Non-parametric - no assumptions on the data and an

unbounded number of parameters

Model Evaluation

Prediction Error = Bias2+ Variance + Irreducible NoiseBias - wrong assumptions when training → can’t captureunderlying patterns → underfit

Variance - sensitive to fluctuations when training→ can’tgeneralize on unseen data → overfit

Regression

Mean Squared Error (MSE) = n1P(yi− ˆy)2

Mean Absolute Error (MAE) =n1P |(yi− ˆy)|

Residual Sum of Squares =P(yi− ˆy)2

Total Sum of Squares =P(yi− ¯y)2

T P +F P, percent correct when predict positive– Recall, Sensitivity = T P

T P +F N, percent of actual positivesidentified correctly (True Positive Rate)

Precision-Recall Curve - focuses on the correct prediction

of class 1, useful when data or FP/FN costs are imbalanced

– Linear relationship and independent observations– Homoscedasticity - error terms have constant variance– Errors are uncorrelated and normally distributed– Low multicollinearity

RegularizationAdd a penalty for large coefficients to reduce overfittingSubset (L0): λ|| ˆβ||0= λ(number of non−zero variables)– Computationally slow, need to fit 2kmodels

– Alternatives: forward and backward stepwise selectionLASSO (L1): λ|| ˆβ||1= λP | ˆβ|

– Coefficients shrunk to zeroRidge (L2): λ|| ˆβ||2= λP( ˆβ)2

– Reduces effects of multicollinearityCombining LASSO and Ridge gives Elastic Net In all cases,

as λ grows, bias increases and variance decreases

Regularization can also be applied to many other algorithms

Logistic Regression

Predicts probability that Y belongs to a binary class (1 or 0).Fits a logistic (sigmoid) function to the data that maximizesthe likelihood that the observations follow the curve

Regularization can be added in the exponent

P (Y = 1) = 1

1 + e−(β0+βx)

Odds - output probability can be transformed usingOdds(Y = 1) =1−P (Y =1)P (Y =1) , where P (13) = 1:2 oddsAssumptions

– Linear relationship between X and log-odds of Y– Independent observations

– Low multicollinearity

Decision Trees

Classification and Regression Tree

CART for regression minimizes SSE by splitting data intosub-regions and predicting the average value at leaf nodes.Trees are prone to high variance, so tune through CV.Hyperparameters

– Complexity parameter, to only keep splits that improveSSE by at least cp (most influential, small cp → deep tree)– Minimum number of samples at a leaf node

– Minimum number of samples to consider a split

CART for classification minimizes the sum of region impurity,where ˆpiis the probability of a sample being in category i.Possible measures, each with a max impurity of 0.5

∼63% of the data, so the out-of-bag 37% can estimateprediction error without resorting to CV

Additional Hyperparameters (no cp):

– Number of trees to build– Number of variables considered at each splitDeep trees increase accuracy, but at a high computationalcost Model bias is always equal to one of its individual trees.Variable Importance - RF ranks variables by their ability tominimize error when split upon, averaged across all trees

Aaron Wang

Trang 9

Naive Bayes

Classifies data using the label with the highest conditional

probability, given data a and classes c Naive because it

assumes variables are independent

Bayes’ Theorem P (ci|a) =P (a|ci )P (ci)

P (a)

Gaussian Naive Bayes - calculates conditional probability

for continuous data by assuming a normal distribution

Support Vector Machines

Separates data between two classes by maximizing the margin

between the hyperplane and the nearest data points of any

class Relies on the following:

Support Vector Classifiers - account for outliers by

allowing misclassifications on the support vectors (points in or

on the margin)

Kernel Functions - solve nonlinear problems by computing

the similarity between points a, b and mapping the data to a

higher dimension Common functions:

– Polynomial (ab + r)d

– Radial e−γ(a−b)2

Hinge Loss - max(0, 1 − yi(wTxi− b)), where w is the margin

width, b is the offset bias, and classes are labeled ±1 Note,

even a correct prediction inside the margin gives loss > 0

k-Nearest Neighbors

Non-parametric method that calculates ˆy using the average

value or most common class of its k-nearest points For

high-dimensional data, information is lost through equidistant

vectors, so dimension reduction is often applied prior to k-NN

Minkowski Distance = (P |ai− bi|p)1/p

– p = 1 gives Manhattan distanceP |ai− bi|

– p = 2 gives Euclidean distancepP(ai− bi)2

Hamming Distance - count of the differences between two

vectors, often used to compare categorical variables

to noise and outliers

k-means++ - improves selection of initial clusters

1 Pick the first center randomly

2 Compute distance between points and the nearest center

3 Choose new center using a weighted probabilitydistribution proportional to distance

4 Repeat until k centers are chosenEvaluating the number of clusters and performance:

Silhouette Value - measures how similar a data point is toits own cluster compared to other clusters, and ranges from 1(best) to -1 (worst)

Davies-Bouldin Index - ratio of within cluster scatter tobetween cluster separation, where lower values are

Hierarchical Clustering

Clusters data into groups using a predominant hierarchyAgglomerative Approach

1 Each observation starts in its own cluster

2 Iteratively combine the most similar cluster pairs

3 Continue until all points are in the same clusterDivisive Approach - all points start in one cluster and splitsare performed recursively down the hierarchy

Linkage Metrics - measure dissimilarity between clustersand combines them using the minimum linkage value over allpairwise points in different clusters by comparing:

– Single - the distance between the closest pair of points– Complete - the distance between the farthest pair of points– Ward’s - the increase in within-cluster SSE if two clusterswere to be combined

Dendrogram - plots the full hierarchy of clusters, where theheight of a node indicates the dissimilarity between its children

Dimension Reduction

Principal Component Analysis

Projects data onto orthogonal vectors that maximize variance.Remember, given an n × n matrix A, a nonzero vector ~x, and

a scaler λ, if A~x = λ~x then ~x and λ are an eigenvector andeigenvalue of A In PCA, the eigenvectors are uncorrelatedand represent principal components

1 Start with the covariance matrix of standardized data

2 Calculate eigenvalues and eigenvectors using SVD oreigendecomposition

3 Rank the principal components by their proportion ofvariance explained = λi

P λ

For a p-dimensional data, there will be p principal componentsSparse PCA - constrains the number of non-zero values ineach component, reducing susceptibility to noise andimproving interpretability

Linear Discriminant Analysis

Maximizes separation between classes and minimizes variancewithin classes for a labeled dataset

1 Compute the mean and variance of each independentvariable for every class

2 Calculate the within-class (σ2

w) and between-class (σ2)variance

3 Find the matrix W = (σ2

w)−1(σ2) that maximizes Fisher’ssignal-to-noise ratio

4 Rank the discriminant components by their signal-to-noiseratio λ

Assumptions– Independent variables are normally distributed– Homoscedasticity - constant variance of error– Low multicollinearity

Factor Analysis

Describes data using a linear combination of k latent factors.Given a normalized matrix X, it follows the form X = Lf + ,with factor loadings L and hidden factors f

Assumptions– E(X) = E(f ) = E() = 0– Cov(f ) = I → uncorrelated factors– Cov(f, ) = 0

Since Cov(X) = Cov(Lf ) + Cov(), then Cov(Lf ) = LL0Scree Plot - graphs the eigenvalues of factors (or principalcomponents) and is used to determine the number of factors toretain The ’elbow’ where values level off is often used as thecutoff

Aaron Wang

Trang 10

Natural Language Processing

Transforms human language into machine-usable code

Processing Techniques

– Tokenization - splitting text into individual words (tokens)

– Lemmatization - reduces words to its base form based on

dictionary definition (am, are, is → be)

– Stemming - reduces words to its base form without context

(ended → end )

– Stop words - remove common and irrelevant words (the, is)

Markov Chain - stochastic and memoryless process that

predicts future events based only on the current state

n-gram - predicts the next term in a sequence of n terms

based on Markov chains

Bag-of-words - represents text using word frequencies,

without context or order

tf-idf - measures word importance for a document in a

collection (corpus), by multiplying the term frequency

(occurrences of a term in a document) with the inverse

document frequency (penalizes common terms across a corpus)

Cosine Similarity - measures similarity between vectors,

calculated as cos(θ) = ||A||||B||A·B , which ranges from o to 1

Word Embedding

Maps words and phrases to numerical vectors

word2vec - trains iteratively over local word context

windows, places similar words close together, and embeds

sub-relationships directly into vectors, such that

king − man + woman ≈ queen

Relies on one of the following:

– Continuous bag-of-words (CBOW) - predicts the word

given its context

– skip-gram - predicts the context given a word

GloVe - combines both global and local word co-occurence

data to learn word similarity

BERT - accounts for word order and trains on subwords, and

unlike word2vec and GloVe, BERT outputs different vectors

for different uses of words (cell phone vs blood cell)

Sentiment Analysis

Extracts the attitudes and emotions from text

Polarity - measures positive, negative, or neutral opinions

– Valence shifters - capture amplifiers or negators such as

’really fun’ or ’hardly fun’

Sentiment - measures emotional states such as happy or sad

Subject-Object Identification - classifies sentences as

either subjective or objective

Topic Modelling

Captures the underlying themes that appear in documents

Latent Dirichlet Allocation (LDA) - generates k topics by

first assigning each word to a random topic, then iteratively

updating assignments based on parameters α, the mix of topics

per document, and β, the distribution of words per topic

Latent Semantic Analysis (LSA) - identifies patterns using

tf-idf scores and reduces data to k dimensions through SVD

1

e z +e −z

Since a system of linear activation functions can be simplified

to a single perceptron, nonlinear functions are commonly usedfor more accurate tuning and meaningful gradients

Loss Function - measures prediction error using functionssuch as MSE for regression and binary cross-entropy forprobability-based classification

Gradient Descent - minimizes the average loss by movingiteratively in the direction of steepest descent, controlled bythe learning rate γ (step size) Note, γ can be updatedadaptively for better performance For neural networks,finding the best set of weights involves:

1 Initialize weights W randomly with near-zero values

2 Loop until convergence:

– Calculate the average network loss J (W )– Backpropagation - iterate backwards from the lastlayer, computing the gradient ∂J (W )∂W and updating theweight W ← W − γ∂J (W )∂W

3 Return the minimum loss weight matrix W

To prevent overfitting, regularization can be applied by:

– Stopping training when validation performance drops– Dropout - randomly drop some nodes during training toprevent over-reliance on a single node

– Embedding weight penalties into the objective functionStochastic Gradient Descent - only uses a single point tocompute gradients, leading to smoother convergence and fastercompute speeds Alternatively, mini-batch gradient descenttrains on small subsets of the data, striking a balance betweenthe approaches

Convolutional Neural Network

Analyzes structural or visual data by extracting local featuresConvolutional Layers - iterate over windows of the image,applying weights, bias, and an activation function to createfeature maps Different weights lead to different features maps

Pooling - downsamples convolution layers to reducedimensionality and maintain spatial invariance, allowingdetection of features even if they have shifted slightly.Common techniques return the max or average value in thepooling window

The general CNN architecture is as follows:

1 Perform a series of convolution, ReLU, and poolingoperations, extracting important features from the data

2 Feed output into a fully-connected layer for classification,object detection, or other structural analyses

Recurrent Neural Network

Predicts sequential data using a temporally connected systemthat captures both new inputs and previous outputs usinghidden states

RNNs can model various input-output scenarios, such asmany-to-one, one-to-many, and many-to-many Relies onparameter (weight) sharing for efficiency To avoid redundantcalculations during backpropagation, downstream gradientsare found by chaining previous gradients However, repeatedlymultiplying values greater than or less than 1 leads to:– Exploding gradients - model instability and overflows– Vanishing gradients - loss of learning ability

This can be solved using:

– Gradient clipping - cap the maximum value of gradients– ReLU - its derivative prevents gradient shrinkage for x > 0– Gated cells - regulate the flow of information

Long Short-Term Memory - learns long-term dependenciesusing gated cells and maintains a separate cell state from what

is outputted Gates in LSTM perform the following:

1 Forget and filter out irrelevant info from previous layers

2 Store relevant info from current input

3 Update the current cell state

4 Output the hidden state, a filtered version of the cell stateLSTMs can be stacked to improve performance

Aaron Wang

Trang 11

Boosting

Ensemble method that learns by sequentially fitting many

simple models As opposed to bagging, boosting trains on all

the data and combines weak models using the learning rate α

Boosting can be applied to many machine learning problems

AdaBoost - uses sample weighting and decision ’stumps’

(one-level decision trees) to classify samples

1 Build decision stumps for every feature, choosing the one

with the best classification accuracy

2 Assign more weight to misclassified samples and reward

trees that differentiate them, where α =1ln1−T otalErrorT otalError

3 Continue training and weighting decision stumps until

convergence

Gradient Boost - trains sequential models by minimizing a

given loss function using gradient descent at each step

1 Start by predicting the average value of the response

2 Build a tree on the errors, constrained by depth or the

number of leaf nodes

3 Scale decision trees by a constant learning rate α

4 Continue training and weighting decision trees until

convergence

XGBoost - fast gradient boosting method that utilizes

regularization and parallelization

Recommender Systems

Suggests relevant items to users by predicting ratings and

preferences, and is divided into two main types:

– Content Filtering - recommends similar items

– Collaborative Filtering - recommends what similar users like

The latter is more common, and includes methods such as:

Memory-based Approaches - finds neighborhoods by using

rating data to compute user and item similarity, measured

using correlation or cosine similarity

– User-User - similar users also liked

– Leads to more diverse recommendations, as opposed to

just recommending popular items

– Suffers from sparsity, as the number of users who rate

items is often low

– Item-Item - similar users who liked this item also liked

– Efficient when there are more users than items, since the

item neighborhoods update less frequently than users

– Similarity between items is often more reliable than

similarity between users

Model-based Approaches - predict ratings of unrated

items, through methods such as Bayesian networks, SVD, and

clustering Handles sparse data better than memory-based

approaches

– Matrix Factorization - decomposes the user-item rating

matrix into two lower-dimensional matrices representing the

users and items, each with k latent factors

Recommender systems can also be combined through ensemble

methods to improve performance

Reinforcement Learning

Maximizes future rewards by learning through state-actionpairs That is, an agent performs actions in an environment,which updates the state and provides a reward

Multi-armed Bandit Problem - a gambler plays slotmachines with unknown probability distributions and mustdecide the best strategy to maximize reward This exemplifiesthe exploration-exploitation tradeoff, as the best long-termstrategy may involve short-term sacrifices

RL is divided into two types, with the former being morecommon:

– Model-free - learn through trial and error in theenvironment

– Model-based - access to the underlying (approximate)state-reward distribution

Q-Value Q(s, a) - captures the expected discounted totalfuture reward given a state and action

Policy - chooses the best actions for an agent at various statesπ(s) = arg max

a

Q(s, a)Deep RL algorithms can further be divided into two maintypes, depending on their learning objective

Value Learning - aims to approximate Q(s, a) for all actionsthe agent can take, but is restricted to discrete action spaces

Can use the -greedy method, where measures theprobability of exploration If chosen, the next action isselected uniformly at random

– Q-Learning - simple value iteration model that maximizesthe Q-value using a table on states and actions

– Deep Q Network - finds the best action to take byminimizing the Q-loss, the squared error between the targetQ-value and the prediction

Policy Gradient Learning - directly optimize the the policyπ(s) through a probability distribution of actions, without theneed for a value function, allowing for continuous actionspaces

Actor-Critic Model - hybrid algorithm that relies on twoneural networks, an actor π(s, a, θ) which controls agentbehavior and a critic Q(s, a, w) that measures how good anaction is Both run in parallel to find the optimal weights θ, w

to maximize expected reward At each step:

1 Pass the current state into the actor and critic

2 The critic evaluates the action’s Q-value, and the actorupdates its weight θ

3 The actor takes the next action leading to a new state, andthe critic updates its weight w

Anomaly Detection

Identifies unusual patterns that differ from the majority of thedata, and can be applied in supervised, unsupervised, andsemi-supervised scenarios Assumes that anomalies are:– Rare - the minority class that occurs rarely in the data– Different - have feature values that are very different fromnormal observations

Anomaly detection techniques spans a wide range, includingmethods based on:

Statistics - relies on various statistical methods to identifyoutliers, such as Z-tests, boxplots, interquartile ranges, andvariance comparisons

Density - useful when data is grouped around denseneighborhoods, measured by distance Methods includek-nearest neighbors, local outlier factor, and isolation forest.– Isolation Forest - tree-based model that labels outliersbased on an anomaly score

1 Select a random feature and split value, dividing thedataset in two

2 Continue splitting randomly until every point is isolated

3 Calculate the anomaly score for each observation, based

on how many iterations it took to isolate that point

4 If the anomaly score is greater than a threshold, mark it

as an outlierIntuitively, outliers are easier to isolate and should haveshorter path lengths in the tree

Clusters - data points outside of clusters could potentially bemarked as anomalies

Autoencoders - unsupervised neural networks that compressdata and reconstructs it The network has two parts: anencoder that embeds data to a lower dimension, and a decoderthat produces a reconstruction Autoencoders do not

reconstruct the data perfectly, but rather focus on capturingimportant features in the data

Upon decoding, the model will accurately reconstruct normalpatterns but struggle with anomalous data The reconstructionerror is used as an anomaly score to detect outliers

Autoencoders are applied to many problems, including imageprocessing, dimension reduction, and information retrieval

Aaron Wang

Trang 12

M ACHINE L EARNING C HEATSHEET

Summary of Machine Learning Algorithms descriptions,

Maximum Likelihood Estimation

MLE is used to find the estimators that minimizes the

likelihood function:

Linear Algorithms

dimension of the hyperplane of the regression is its

complexity

Trang 13

Learning:

E

?CD

)F+ 𝜆 |𝛽9|

B 9CD

= 𝑅𝑆𝑆 + 𝜆 |𝛽9|

B 9CD

E

?CD

)F+ 𝜆 𝛽9FB 9CD

= 𝑅𝑆𝑆 + 𝜆 𝛽9F

B 9CD

where 𝜆 ≥ 0 is a tuning parameter to be determined

Usecase examples:

- Customer scoring with probability of purchase

- Classification of loan defaults according to profile

Linear Discriminant Analysis

For multiclass classification, LDA is the preferred linear

Trang 14

The most common Stopping Criterion for splitting is a minimum of training observations per node

The simplest form of pruning is Reduced Error Pruning:

Starting at the leaves, each node is replaced with its most popular class If the prediction accuracy is not affected, then the change is kept

Advantages:

+ Easy to interpret and no overfitting with pruning + Works for both regression and classification problems + Can take any type of variables without modifications, and

Usecase examples:

- Fraudulent transaction classification

- Predict human resource allocation in companies

Naive Bayes Classifier

Naive Bayes is a classification algorithm interested in

Trang 15

Bagging and Random Forest

Random Forest is part of a bigger type of ensemble

methods called Bootstrap Aggregation or Bagging Bagging can reduce the variance of high-variance models

It uses the Bootstrap statistical procedure: estimate a

quantity from a sample by creating many random subsamples with replacement, and computing the mean of each subsample

Trang 16

Boosting and AdaBoost

AdaBoost was the first successful boosting algorithm

Which is the weighted sum of the misclassification rates,

Định dạng
Số trang	33
Dung lượng	11,53 MB