All important cheat sheets 25032021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn wwwdin cominabhishek prasad ap D A T A 10 10 B r o w n U n iv e r s it y S a m u e l S W a t s o n Sets.All important cheat sheets 25032021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn wwwdin cominabhishek prasad ap D A T A 10 10 B r o w n U n iv e r s it y S a m u e l S W a t s o n Sets.
Trang 1All important cheat sheets
25/03/2021 COMPILED BY ABHISHEK PRASAD
Follow me on LinkedIn: www.linkedin.com/in/abhishek-prasad-ap
Trang 21 A set is an unordered collection of objects The objects in
a set are called elements.
2 The cardinality of a set is the number of elements it
con-tains The empty set ∅ is the set with no elements.
3 If every element of A is also an element of B, then we say
Ais a subset of B and write A⊂ B If A ⊂ B and B ⊂ A,
then we say that A = B.
5 Two sets A and B are disjoint if A∩ B = ∅ (in other
words, if they have no elements in common).
6 A partition of a set is a collection of nonempty disjoint
subsets whose union is the whole set.
7 The Cartesian product of A and B is
A × B = {(a, b) : a ∈ A and b ∈ B
8 (De Morgan’s laws) If A, B⊂ Ω, then
(i) (A B) c = A c ∪ B c , and
(ii) (A B) c = A c ∩ B c
9 A list is an ordered collection of finitely many objects.
Lists differ from sets in that (i) order matters, (ii) repetition
matters, and (iii) the cardinality is restricted.
Sets and Functions: Functions
1 If A and B are sets, then a function f : A→ B is an
assignment of some element of B to each element of A.
2 The set A is called the domain of f and B is called the
codomain of f
3 Given a subset A 0 of A, we define the image of
f—denoted f (A 0 )—to be the set of elements which are
mapped to from some element in A 0
4 The range of f is the image of the domain of f
5 The composition of two functions f : A → B and
g : B → C is the function g ◦ f which maps a ∈ A to
g(f (a)) ∈ C.
6 The identity function on a set A is the function f : A→
A which maps each element to itself.
7 A function f is injective if no two elements in the domain
map to the same element in the codomain.
8 A function f is surjective if the range of f is equal to the
codomain of f
9 A function f is bijective if it is both injective and
surjec-tive If f is bijective, then the function from B to A that maps
b ∈ B to the element a ∈ A that satisfies f (a) = b is called
the inverse of f
10If f : A → B is bijective, then the function f −1 ◦ f is
equal to the identity function on A, and f ◦ f −1 is the
iden-tity function on B.
1 A value is a fundamental entity that may be manipulated
by a program Values have types; for example,5 is anIntand
"Hello world!" is aString.
2 A variable is a name used to refer to a value We can
assign a value5 to a variable x using x = 5
3 A function performs a particular task You prompt a function to perform its task by calling it Values supplied to a function are called arguments For example, in the function
call print ( 2 , 1 and 2 are arguments.
4 An operator is a function that can be called in a special
way For example, * is an operator since we can call the tiplication function with the syntax 3 * 5
mul-5 A statement is an instruction to be executed (likex = -3 ).
6 An expression is a combination of values, variables,
op-erators, and function calls that a language interprets and
evaluates to a value.
7 A numerical value can be either an integer or a float The
basic operations are + , / , and expressions are ated according to the order of operations.
evalu-8 Numbers can be compared using < , == , or ≥
9 Textual data is represented using strings.length (s) turns the number of characters in s The * operator concate- nates strings.
re-10 A boolean is a value which is either trueorfalse Booleans can be combined with the operators && (and), ||
13 The scope of a variable is the region in the program
where it is accessible Variables defined in the body of a tion are not accessible outside the body of the function.
func-14 Arrayis a compound data type for storing lists of jects Entries of an array may be accessed with square bracket
ob-syntax using an index or using a range objecta
A [ -5 , 2 ] A[ 2 A[ end]
15 An array comprehension can be used to generate new
arrays: [k fork =1:10ifmod (k, 2 == 0 ]
16 A dictionary encodes a discrete function by storing
input-output pairs and looking up input values when dexed This expression returns [ 0 1.0 ] :
in-Dict ( "blue" => [ 0 1.0 ], "red" => [ 1.0 , 0 ])[ "blue" ]
and evaluates them alternatingly until the conditional pression returns false Aforloop evaluates its body once
ex-for each entry in a given iterator (ex-for example, a range, array,
or dictionary) Each value in the iterator is assigned to a loop variable which can be referenced in the body of the loop.
whilex > 0
x -= 1
end
fori =1:10 print (i)
end
1 A vector inRn is a column of n real numbers, also written
as [v 1 , , v n ] A vector may be depicted as an arrow from
the origin in n-dimensional space The norm of a vector v is
the length q
v 2 + · · · +v 2 of its arrow.
2 The fundamental vector space operations are vector
ad-dition and scalar multiplication.
3 A linear combination of a list of vectors v1, , vk is an expression of the form
c1v 1+ c2v 2+ · · · + ckv k, where c 1 , , ckare real numbers The c’s are called the
weights of the linear combination.
4 The span of a list L of vectors is the set of all vectors which
can be written as a linear combination of the vectors in L.
5 A list of vectors is linearly independent if and only if the
only linear combination which yields the zero vector is the one with all weights zero.
6 A vector space is a nonempty set of vectors which is
closed under the vector space operations.
7 A list of vectors in a vector space is a spanning list of that
vector space if every vector in the vector space can be written
as a linear combination of the vectors in that list.
8 A linearly independent spanning list of a vector space is
called a basis of that vector space The number of vectors in
a basis of a vector space is called the dimension of the space.
9 A linear transformation L is a function from a vector
space V to a vector space W which satisfies L(cv +βw) =
cL(v) + L(w) for all c ∈ R, u, v ∈ These are “flat maps”:
equally spaced lines are mapped to equally spaces lines or points Examples: scaling, rotation, projection, reflection.
10 Given two vector spaces V and W, a basis { 1, , vn }
of V, and a list {w1, , wn }of vectors in W, there exists one
and only one linear transformation which maps v1to w1, v2
to w2 , and so on.
11 The rank of a linear transformation from one vector
space to another is the dimension of its range.
12 The null space of a linear transformation is the set of
vectors which are mapped to the zero vector by the linear transformation.
13 The rank of a transformation plus the dimension of its
null space is equal to the dimension of its domain (the
rank-nullity theorem).
Linear Algebra: Matrix Algebra
1 The matrix-vector product Ax is the linear combination
of the columns of A with weights given by the entries of x.
2 Linear transformations from Rnto Rmare in one-to-one correspondence with m × n matrices.
3 The identity transformation corresponds to the identity
matrix, which has entries of 1 along the diagonal and zero
entries elsewhere.
4 Matrix multiplication corresponds to composition of
the corresponding linear transformations: AB is the matrix for which (AB)(x) = A(Bx)for all x.
5 A m × nmatrix is full rank if its rank is equal to
min(m, n)
6 Ax=b has a solution x if and only if b is in the span of the columns of A If Ax=bdoes have a solution, then the solution is unique if and only if the columns of A are linearly
independent If Ax=bdoes not have a solution, then there
is a unique vector x which minimizes|Ax −b| 2
7 If the columns of a square matrix A are linearly
indepen-dent, then it has a unique inverse matrix A−1 with the
prop-erty that Ax=b implies x= A−1b for all x and b.
8 Matrix inversion satisfies (AB)−1= B−1A−1if A and B are both invertible.
9 The transpose A0 of a matrix A is defined so that the rows
of A 0 are the columns of A (and vice versa).
10 The transpose is a linear operator: (cA+ B)0= cA0+ B0
if c is a constant and A and B are matrices.
11 The transpose distributes across matrix multiplication but with an order reversal: (AB) 0 = B 0 A 0 if A and B are matrices for which AB is defined.
15 det AB = det A det B and det A −1 = (det A)−1.
16 A square matrix is invertible if and only if its nant is nonzero.
determi-Linear Algebra: Orthogonality
1 The dot product of two vectors inRnis defined by
x·y= x 1 y 1 + x 2 y 2 + · · · + x n y n
2 x·y= kxkky cosθ, where x, y ∈ R nand θ is the angle
between the vectors.
3 x·y= 0if and only if x and y are orthogonal.
4 The dot product is linear: x· (cy +z) =cx·y+x·z.
5 The orthogonal complement of a subspace V⊂ R n is the set of vectors which are orthogonal to every vector in V.
6 The orthogonal complement of the span of the columns
of a matrix A is equal to the null space of A 0
7 rank A = rank A0A for any matrix A.
8 A list of vectors satisfying vi ·vj = 0 for i 6= jis
orthog-onal An orthogonal list of unit vectors is orthonormal.
9 Every orthogonal list is linearly independent
10 A matrix U has orthonormal columns if and only if
U0U = I A square matrix with orthonormal columns is
called orthogonal An orthogonal matrix and its transpose
are inverses.
11 Orthogonal matrices represent rigid transformations
(ones which preserve lengths and angles).
12 If U has orthonormal columns, then UU 0 is the matrix which represents projection onto the span of the columns of U.
Linear Algebra: Spectral Analysis
1 An eigenvector v of an n× n matrix A is a nonzero
vec-tor with the property that Av=λv for some λ∈ R We call
λ an eigenvalue.
If v is an eigenvector of A, then A maps the line span({ v})
to itself:
Trang 33 Not every n × n matrix A has n linearly independent
eigenvectors If A does have n linearly independent
eigen-vectors, we can make a matrix V with these eigenvectors as
columns and get
where V is an orthogonal matrix (the spectral theorem).
6 A symmetric matrix is positive semidefinite if its
eigen-values are all nonnegative We define the square root of a
positive semidefinite matrix A = VΛV 0 to be V√ΛV 0 , where
√
Λ is obtained by applying the square root function
elemen-twise.
Linear Algebra: SVD
1 The Gram matrix A0 A of any m × n matrix A is positive
semidefinite Furthermore, |√A 0Ax| = |Ax|for all x∈ R n
2 The singular value decomposition is the factorization of
any rectangular m × nmatrix A as UΣV 0 , where U and V are
orthogonal and Σ is an m × n diagonal matrix (with diagonal
entries in decreasing order).
3 The diagonal entries of Σ are the singular values of A,
and the columns of U and V are called left singular vectors
and right singular vectors, respectively A maps each right
singular vector vito the corresponding left singular vector
uiscaled by σi
4 The vectors in Rnstretched the most by A are the ones
which run in the direction of the column or columns of V
corresponding to the greatest singular value Same for least.
5 For k ≥ 1, the k-dimensional vector space with minimal
sum of squared distances to the columns of A (interpreted
as points inRm ) is the span of the first k columns of U.
6 The absolute value of the determinant of a square matrix
is equal to the product of its singular values.
1 A sequence of real numbers(x n ) ∞
n=1=x 1 , x 2,
con-verges to a number x∈ R if the distance from x n to x on the number line can be made as small as desired by choosing n sufficiently large We say lim n→ ∞ x n = x or x n → x.
2 (Squeeze theorem) If an ≤ b n ≤ c n for all n ≥ 1 and if lim n→ ∞ a n = lim n→ ∞ c n = b, then b n → b as n → ∞.
3 (Comparison test) If ∑∞
n=1 b n converges and if |a n | ≤ b n for all n, then ∑ ∞
n=1 anconverges if and only if −1 < a < 1.
5 The Taylor series, centered at c, of an infinitely
differen-tiable function f is defined to be
f (c) + f0(c)(x − c) +f
00 (c) 2! (x−c)
2 + f000(c) 3! (x−c)
8 Given f :Rn→ R m, we define ∂f/∂x to be the matrix
whose (i, j)th entry is ∂ f i/∂xj Then (i) ∂
∂x(Ax) = A (ii) ∂
∂x(x0A) = A0(iii)∂
∂x(u0v) =u0∂v
∂x+v0∂u
∂x.
9 A function of two variables is differentiable at a point if
its graph looks like a plane when you zoom in sufficiently
around the point More generally, a function f :Rn→ R m is
differentiable at x if it is well-approximated by its derivative near x:
10 The HessianH of f : Rn → R is the matrix
of its second order derivatives: Hi,j(x) = ∂
(i) f realizes an absolute maximum and absolute minimum
on D (the extreme value theorem).
(ii) Any point where f realizes an extremum is either a critical point—meaning that ∇f = 0 or f is non- differentiable at that point—or at a point on the bound- ary.
(iii) (Lagrange multipliers) If f realizes an extremum at a
point on a portion of the boundary which is the level set
of a differentiable function g with non-vanishing ent ∇g, then either f is non-differentiable at that point
gradi-or the equation
∇f =λ∇g
is satisfied at that point, for some λ∈ R.
12 If r :R1 → R 2 and f :R2 → R 1 , then d
D f (x, y)dx dy can be interpreted as the mass of an object
occupying the region D and having mass density f(x, y) at each point (x, y).
14 Double integration over D: the bounds for the outer tegral are the smallest and largest values of y for any point
in-in D, and the bounds for the in-inner in-integral are the smallest and largest values of x for any point in a given “y = constant”
slice of the region.
15 Polar integration over D: the outer integral bounds are
the least and greatest values of θ for a point in D, and the
inner integral bounds are the least and greatest values of r
for any point in D along each given “θ= constant” ray The area element is dA = r dr dθ.
Numerical Computation: machine arithmetic
1 Computers store numerical values as sequences of bits.
The type of a numeric value specifies how to interpret the
underlying sequence of bits as a number.
2 TheInt64type uses 64 bits to represent the integers from
−2 63 to 2 63 − 1 For 0 ≤ n ≤ 2 63 − 1, we represent n using its binary representation, and for 1 ≤ n ≤ 263, we represent−n using the binary representation of 2 64 − n.Int64arithmetic
is performed modulo 2 64
3 TheFloat64type uses 64 bits to represent real numbers.
We call the first bit σ, the next 11 bits (interpreted as a binary
integer) e ∈ [0, 2047], and the final 52 bits f ∈ [0, 2 52 − 1] If
e / ∈ {0, 2047}, then the number represented by (σ, e, f) is
x = (−1)σ2 e−1023 1 + f 1
2
52 !
The representable numbers between consecutive powers of
2 are the ones obtained by 52 recursive iterations of binary subdivision The value of e indicates the powers of 2 that
x is between, and the value of f indicates the position of x between those powers of 2.
TheFloat64exponent value e = 2047 is reserved for Inf and NaN , while e = 0is reserved for the subnormal numbers:
representable valuelargest finite
4 TheBigIntandBigFloatare types use an arbitrary ber of bits and can handle very large numbers or very high precision Computations are much slower than for 64-bit types.
num-Numerical Computation: Error
1 If A bis an approximation for A, then the relative error is
b A−A A
2 Roundoff error comes from rounding numbers to fit
them into a floating point representation.
3 Truncation error comes from using approximate
math-ematical formulas or algorithms.
4 Statistical error arises from using randomness in an
ap-proximation.
5 The condition number of a function measures how it
stretches or compresses relative error The condition
num-ber of a problem is the condition numnum-ber of the map from
the problem’s initial data a to its solution S(a):
a−b (so subtracting b is
ill-conditioned near b—this is called catastrophic
9 An algorithm which solves a problem with error much
greater than κ²machis unstable An algorithm is unstable if
at least one of the steps it performs is ill-conditioned If every step of an algorithm is well-conditioned, then the algorithm
is stable.
10 The condition number of an matrix A is defined to be
the maximum condition number of the function x7→Ax
over its domain The condition number is equal to the ratio
of the largest to the smallest singular value of A.
Numerical Computation: PRNGs
1 A pseudorandom number generator (PRNG) is an
al-gorithm for generating a deterministic sequence of numbers which is intended to share properties with a sequence of ran-
dom numbers The PRNG’s initial value is called its seed.
2 The linear congruential generator: fix positive integers
M, a, and c, and consider a seed X 0 ∈ {0, 1, , M − 1} We return the sequence X 0 , X 1 , X 2 , , where X n = mod(aXn−1+ c, M) for n ≥ 1.
3 The period of a PRNG is the minimum length of a
repeat-ing block A long period is a desirable property of a PRNG, and a very short period is typically unacceptable.
4 Frequency tests check whether blocks of terms appear
with the appropriate frequency (for example, we can check whether a 2n > a 2n−1 for roughly half of the values of n).
Numerical Computation: Automatic Differentiation
1 A dual number is an object that can be substituted into
a function f to yield both the value of the function and its derivative at a point x.
2 If f is a function which can act on matrices, then h
x 1
0 x i represents a dual number at x, since f h x 1
0 x i
be checked to hold for f + g and f g whenever it holds for f and g, and it holds for the identity function).
3 To find the derivative of f with automatic differentiation, every step in the computation of f must be dual-number- aware See the packages ForwardDiff (for Julia) and autograd (for Python).
Numerical Computation: Optimization
1 Gradient descent seeks to minimize f :Rn→ R by peatedly stepping in f ’s direction of maximum decrease We
re-begin with a value x0∈ R n and repeatedly update using the
rule xn+1=xn −²∇f(xn−1), where ² is the learning rate.
We fix a small number τ> 0 and stop when |∇f(xn )| <τ.
2 A function is strictly convex if its Hessian is positive
semidefinite everywhere A strictly convex function has at most one local minimum, and any local minimum is also a global minimum Gradient descent will find the global min- imum for a convex function, but for non-convex functions it can get stuck in a local minimum.
3 Algorithms similar to gradient descent but with usually
faster convergence: conjugate gradient, BFGS, L-BFGS.
Trang 41 Fundamental principle of counting: If one experiment
has m possible outcomes, and if a second experiment has n
possible outcomes for each of the outcomes in the first
ex-periment, then there are mn possible outcomes for the pair
(n−r)!ordered r-tuples of distinct elements of S.
4 Combinations: The number of r-element subsets of an
n-element set is (nr) = n!
r!(n−r)!
Probability: Probability Spaces
1 Given a random experiment, the set of possible outcomes
is called the sample space Ω, like{( H , H ), ( H , T ), ( T , H ), ( T , T )}.
2 We associate with each outcome ω∈Ω a probability
mass, denoted m(ω) For example, m (( H , T )) = 1
3 In a random experiment, an event is a predicate that can
be determined based on the outcome of the experiment (like
“first flip turned up heads”) Mathematically, an event is a
subset of Ω (like {( H , H ), ( H , T )}).
4 Basic set operations ∪, ∩, and c correspond to
disjunc-tion, conjuncdisjunc-tion, and negation of events:
(i) The event that E happens or F happens is E ∪ F.
(ii) The event that E happens and F happens is E ∩ F.
(iii) The event that E does not happen is E c
5 If E and F cannot both occur (that is, E ∩ F = ∅), we say
that E and F are mutually exclusive or disjoint.
6 If E’s occurrence implies F’s occurrence, then E ⊂ F.
7 The probabilityP(E)of an event E is the sum of the
prob-ability masses of the outcomes in that event The domain of
P is 2 Ω , the set of all subsets of Ω.
8 The pair ( Ω,P) is called a probability space The
funda-mental probability space properties are
(i) P(Ω ) = 1 — “something has to happen”
(ii) P(E) ≥ 0 — “probabilities are non-negative”
(iii) P(E ∪ F) = P(E) + P(F) if E and F are mutually
exclu-sive — “probability is additive”.
9 Other properties which follow from the fundamental
ones:
(i) P(∅ ) = 0
(ii) P(E c ) = 1 − P(E)
(iii) E ⊂ F =⇒ P(E) ≤ P(F) (monotonicity)
(iv) P(E ∪ F) = P(E) + P(F) − P(E ∩ F) (principle of
inclusion-exclusion).
Probability: Random Variables
1 A random variable is a number which depends on the
result of a random experiment (one’s lottery winnings, for
example) Mathematically, a random variable is a function
X from the sample space Ω to R.
2 The distribution of a random variable X is the
probabil-ity measure on R which maps each set A ⊂ R to P(X ∈ A).
The probability mass function of the distribution of X may
be obtained by pushing forward the probability mass from
each ω∈ Ω:
3 The cumulative distribution function (CDF) of a
ran-dom variable X is the function F X (x) = P(X ≤ x).
m X (x) 1
F X (x) 1
4 The joint distribution of two random variables X and
Y is the probability measure on R2which maps A ⊂ R 2 to P((X, Y) ∈ A) The probability mass function of the joint distribution is m(X,Y)(x, y) = P(X = x and Y = y).
Probability: Conditional Probability
1 Given a probability space Ω and an event E ⊂Ω, the
con-ditional probability measure given E is an updated
proba-bility measure on Ω which accounts for the information that
the result ω of the random experiment falls in E:
P(F | E) =P(F∩E)
P(E)
2 The conditional probability mass function of Y given {X = x} is mY | X=x(y) = mX,Y(x, y)/m X (x).
3 Bayes’ theorem tells us how to update beliefs in light
of new evidence It relates the conditional probabilities P(A | E ) and P(E | A ) :
P( A | E) =P(E|A)P(A)
P(E) =
P(E | A)P(A) P(E | A)P(A) + P(E | A c )P(A c )
4 Two events E and F are independent ifP(E ∩ F) = P(E)P(F).
5 Two random variables X and Y are independent if the
every pair of events of the form {X ∈ A} and {Y ∈ B are independent, where A ⊂ R and B ⊂ R.
6 The PMF of the joint distribution of a pair of independent random variables factors as mX,Y(x, y) = m X (x)m Y (y):
1 The expectationE[X](or mean µX ) of a random variable
Xis the probability-weighted average of X:
E[X] =∑
ω∈ Ω
X(ω)m(ω)
2 The expectation E[X] may be thought of as the value of
a random game with payout X, or as the long-run average
of X over many independent runs of the underlying
experi-ment The Monte Carlo approximation ofE[X] is obtained
by simulating the experiment many times and averaging the value of X.
3 The expectation is the center of mass of the distribution
of X:
4 The expectation of a function of a discrete random able (or two random variables) may be expressed in terms of the PMF m X of the distribution of X (or the PMF m(X,Y)of the joint distribution of X and Y):
vari-E[g(X)] = ∑
x∈R g(x)m X (x) E[g(X, Y)] = ∑
(x,y)∈R2 g(x, y)m(X,Y)(x, y).
5 Expectation is linear: if c∈ R and X and Y are random variables defined on the same probability space, then
E[cX + Y] = cE[X] + E[Y]
6 The variance of a random variable is its average squared
deviation from its mean The variance measures how spread
out the distribution of X is The standard deviation σ(X)is the square root of the variance.
7 Variance satisfies the properties, if X and Y are dent random variables and a ∈ R:
indepen-Var(aX) = a 2 Var X Var(X + Y) =Var(X) + Var(Y)
8 The covariance of two random variables X and Y is the
expected product of their deviations from their respective
means µX = E[X]and µY= E[Y]:
Cov(X, Y) = E[(X −µX )(Y −µY )] = E[XY] − E[X]E[Y].
9 The covariance of two independent random variables is zero, but zero covariance does not imply independence.
10 The correlation of two random variables is their
nor-malized covariance:
Corr(X, Y) =Covσ(X)σ(Y)(X, Y)∈ [−1, 1]
11 The covariance matrix of a vector X= [ X 1 , , X n ] of random variables defined on the same probability space is defined to be the matrix Σ whose (i, j)th entry is equal to Cov(X i , X j ) If E[X] = 0, then Σ = E[XX0].
Probability: Continuous Distributions
1 If Ω ⊂ R n and P(A) = R
A f , where f ≥ 0 and R
Rn f = 1, then we call ( Ω, P)a continuous probability space.
x
a b
f ( x )
P([a ,b])
2 The function f is called a density, because it measures
the amount of probability mass per unit volume at each point (2D volume = area, 1D volume = length).
3 If (X, Y) is a pair of random variables whose joint bution has density fX,Y: R2→ R, then the conditional dis- tribution of Y given the event {X = x} has density f Y | X=x defined by
distri-fY | {X=x}(y) =fX,Y(x, y)
f X (x) ,where fX(x) =
Z ∞
− ∞f(x, y)dyis the PDF of X.
4 If a random variable X has density f X on R, then
E[g(X)] = Z R g(x)f X (x)dx.
5 CDF sampling: F−1 (U) has CDF F if f U = 1[0,1].
Probability: Conditional Expectation
1 The conditional expectation of a random variable given
an event is the expectation of the random variable calculated with respect to the conditional probability measure given that event: if (X, Y) has PMF mX,Y, then
E[Y | X = x] =∑
y∈R
ymY | X=x(y), where mY | X=x(y) = mX,Y (x,y)
mX (x) If (X, Y) has pdf fX,Y, then E[Y |X = x] =
Z R
y fY | X=x(y)dy.
2 The conditional expectation of a random variable Y given another random variable X is obtained by substituting X for
x in the expression for the conditional expectation of Y given
X = x Thus E[Y | X] is a random variable.
3 If X and Y are independent, then E[Y |X] = E[Y] If Z is
a function of X, thenE[ZY |X] = ZE[Y| X].
4 The law of iterated expectation:E[E[Y |X]] = E[Y].
Probability: Common Distributions
1 Bernoulli (Ber(p) ): A weighted coin flip.
1 0
m() = (n
k)p k(1−p)n−k
Trang 53 Geometric (Geom(p)): Time to first success (1) in a
se-quence of independent Ber(p)’s.
=λ
5 Exponential distribution (Exp(λ)): Limit as n → ∞ of
distribution of 1/n times a Geometric (λ/n)
se-ables (i.i.d.) with E[X 1 ] =µ and Var(X1 ) =σ2 < ∞ (see
Central Limit Theorem).
7 Multivariate normal distribution (N (0,Σ )): if Z =
2 A sequence ν1 ,ν2 , of probability measures on Rn
con-verges to a probability measure ν if νn (A) →ν(A) for every set A ⊂ R nwith the property that ν(∂A) =0 (intuitively, two measures are close if they put approximately the same amount of mass in approximately the same places) We say
X nconverges in distribution to ν if the distribution of Xn
converges to ν.
3 Chebyshev’s inequality: if X is a random variable with
variance σ2 < ∞, then X differs from its mean by more than
k standard deviations with probability at most k −2 :
6 We define the standardized running sum of X1, X2,
to have zero mean and unit variance for all n ≥ 1:
S∗= X 1 + X 2 + · · · +X n − nµ
σ√n
7 Central limit theorem: the sequence of standardized
sums of an i.i.d sequence of finite-variance random variables converges in distribution to N (0, 1): for any interval [a, b],
we have P(S∗∈ [a, b]) →Zb
a 1
√ 2πe
9 The central limit theorem explains the ubiquity of the normal distribution in statistics: many random quantities may be realized as a sum of a multitude of independent con- tributions.
1 Statistical learning: Given some samples from a
proba-bility space with an unknown probaproba-bility measure, we seek
to draw conclusions about the measure.
2 Supervised learning:(X, Y) is drawn from an unknown probability measure P on a product space X × Y , and we
aim to predict Y given X, based on a i.i.d collection of
sam-ples from P(the training data).
Example: X= [X 1 , X 2 ], where X 1is the color of a banana, X2
is the weight of the banana, and Y is a measure of deliciousness.
Values of X1 , X 2, and Y are recorded for many bananas, and they
are used to predict Y for other bananas whose X values are known.
3 We call the components of X features, predictors, or input
variables, and we call Y the response variable or output variable.
4 A supervised learning problem is a regression problem
if Y is quantitative (Y ⊂ R) and a classification problem if
Y is a set of labels.
5 To choose a prediction function h : X → Y, we specify a (i) a space H of candidate functions, and
(ii) a loss (or risk) functional L fromH to R.
The target function is argminh∈HL(h).
6 If the loss functional for a regression problem is
L(h) = E[(h(X) − Y) 2 ] and H contains r(x) = E[Y |X=x], then r is the target func- tion If the loss functional for a classification problem is
L(h) = Eh1{h(X)6=Y}i, and H contains G(x) = argmaxcP(Y = c|X=x), then G is the target function.
7 Since P is unknown, we must approximate the target function with a function b h whose values can be computed
from the training data A learner is a function which takes a
set of training data and returns a prediction function b h.
8 The empirical probability measure onX × Yis the sure which assigns a probability mass of 1 to the location
mea-of each training sample (X1 , Y 1 ) , (X2 , Y 2 ) , , (Xn , Y n ) The
empirical risk of a candidate function h is the risk functional
evaluated with respect to the empirical measure of the
train-ing data The empirical risk minimizer (ERM) is the
func-tion which minimizes empirical risk.
9 Generalization error (or test error) is the difference
be-tween empirical risk and the actual value of the risk tional.
func-10 The ERM can overfit, meaning that test error and L(bh) are large despite small empirical risk.
Example: ifHis the space of polynomials and no two training
samples have the same x values, then then there are functions inH
which have zero empirical risk.
risk minimizer
empirical risk minimizer
x y
11 Mitigate overfitting with inductive bias:
(i) Use a restrictive class H of candidate functions.
(ii) Regularize: add a term to the loss functional which
penalizes complexity.
12 Inductive bias can lead to underfitting: relevant
rela-tions are missed, so both training and test error are larger than necessary The tension between the costs of high in- ductive bias and the costs of low inductive bias is called the
bias-complexity (or bias-variance) tradeoff.
13 No-free-lunch theorem: all learners are equal on
aver-age (over all possible problems), so inductive bias ate to a given type of problem is essential to have an effective learner for that type of problem.
appropri-Statistical Learning: Kernel density estimation
1 Given n samples X 1 , , X n from a distribution with density f on R, we can estimate the PDF of the distribu- tion by placing 1/n units of probability mass in a small pile around each sample.
2 We choose a kernel function for the shape of each pile:
total mass = 1
3 The width of each pile is specified by a bandwidth λ:
Dλ(u) = 1
λD uλ.
4 The kernel density estimator with bandwidth λ is the
sum of the piles at each sample:
b
fλ(x) =1n
n
∑
i=1
Dλ(x − X i ).
5 To choose a suitable bandwidth, we seek to minimize the
integrated squared error (ISE) L(f ) = R
R (f − b f) 2
6 We approximate the minimizer of L with the minimizer
of the cross-validation loss estimator
J(f ) = Z R b
fλ2−2n
n
∑
i=1 b
fλ(−i)(X i ), where b fλ(−i)is the KDE with the ith sample omitted.
7 If f is a density on R2, then we use the KDE b
fλ(x, y) =1n
8 Stone’s theorem says that the ratio of the CV ISE to the
optimal-λ ISE converges to 1 in probability as n→ ∞ Also,
the optimal λ goes to 0 like 1
n1/5 , and the minimal ISE goes
to 0 like 1 n4/5
9 The Nadaraya-Watson nonparametric regression
estima-torb r(x)computes E[Y |X = x] with respect to the estimated density b fλ Equivalently, we average the Y i ’s, weighted ac- cording to horizontal distance from x:
Trang 61 Parametric regression uses a familyH of candidate
func-tions which is indexed by finitely many parameters.
2 Linear regression uses the set of affine functions:H =
{x7→β0 + [β1 , ,βp ] ·x:β0 , βp ∈ R}.
3 We choose the parameters to minimize a risk function,
customarily the residual sum of squares:
RSS(˛) =
n
∑
i=1 (y i −β0 −˛·xi )2= |y−X˛|2,
where y = [y 1 , , y n ], ˛ = [β 0 , ,β p ], and X is an
n × (p + 1 ) matrix whose ith row is a 1 followed by the
com-ponents of xi.
xy
4 The RSS minimizer is ˛= (X0X)−1X0Y.
5 We can use the linear regression framework to do
poly-nomial regression, since a polypoly-nomial is linear in its
coeffi-cients: we supplement the list of regressors with products of
the original regressors.
Statistical Learning: Optimal classification
1 Consider a classification problem with feature set X and
class set Y For each y ∈ Y, we define p y = P(Y = y) and
let f ybe the conditional PMF or PDF of X given{Y = y} (y’s
class conditional distribution).
2 Given a prediction function (or classifier) h and an
enu-meration of the elements of Y as {y 1 , y 2 , , y|Y |}, we define
the (normalized) confusion matrix of h to be the|Y | × |Y |
matrix whose (i, j)th entry is P(h(X) = y i |Y = y j ).
3 If Y = {−, +}, the conditional probability of correct
classification given a positive sample is the detection rate
(DR), while the conditional probability of incorrect
classifi-cation given a negative sample is the false alarm rate (FAR).
4 The precision of a classifier is the conditional
probabil-ity that a sample is positive given that the classifier predicts
positive, and recall is a synonym of detection rate.
5 The Bayes classifier G(x) = argmaxyp y f y (x) minimizes
the misclassification probability but gives equal weight to
both types of misclassification.
6 The likelihood ratio test generalizes the Bayes
classi-fier by allowing a variable tradeoff between false-alarm rate
and detection rate: given t > 0, we say h t (x) = − 1 if
f + (x)/ f − (x) < t and h t (x) = 1 otherwise.
7 The Neyman-Pearson lemma says that no classifier does
better on both false alarm rate and detection rate than h t
8 The receiver operating characteristic of ht is the curve
{(FAR(h t ), DR(h t )) : t ∈ [0,∞ ]}.
The AUROC (area under the
ROC) is close to 1 for an excellent
classifier and close to 1 for a
worthless one NP says that
no classifier is above the ROC.
We choose a point on the ROC
curve based on context-specific
1 Quadratic discriminant analysis (QDA) is a
classifica-tion algorithm which uses the training data to estimate the
mean —y and covariance matrix Σ y of each class conditional distribution:
b
—y = mean({xi: yi= y}) b
Σ y = mean ({(xi −—by)(xi −—by)0: y i = y}).
Each distribution is assumed to be multivariate normal (N (—by, Σ b y ) ) and the classifier h (y) = argmax y b p y f b y (x) is proposed (where { pby : y ∈ Y } are the class proportions from the training data).
2 Linear discriminant analysis (LDA) is the same as QDA
except the class covariance matrices are assumed to be equal and are estimated using all of the data, not just class-specific samples.
3 QDA and LDA are so named because they yield class prediction boundaries which are quadric surfaces and hy- perplanes, respectively.
4 A Naive Bayes classifier assumes that the features are
conditionally independent given Y:
f y (x 1 , , x p ) = fy,1(x 1 ) · · · fy,d(x p ), for some fy,1, f y,p
5 Example assumption-satisfying data sets:
QDA LDA Naive Bayes
Statistical Learning: Logistic regression
1 Logistic regression for binary classification estimates
r(x) = P(Y = 1 |X = x)as a logistic function of a linear
function of x:
b r(x) =σ(α+ ˛·x), where
1 − r(x i )
, which applies large penalty if yi= 1 and r(xi) is close to zero or if yi= 0 and r(xi) is close to 1.
3 L is convex, so it can be reliably minimized using ical optimization algorithms.
numer-Statistical Learning: Support vector machines
1 A support vector machine
(SVM) chooses a hyperplane
H ⊂ R p and predicts tion (Y = {−1,+1}) based on which side of H the feature vec-
classifica-tor x lies on.
2 x 7→ sgn(˛·x−α)is the prediction function, where
where [u] + denotes max(0, u), the positive part of u.
4 The parameters ˛ and α encode both H and the a
distance—called the margin—from H to a parallel
hyper-plane where we begin penalizing for lack of decisively rect classification The margin is 1/|˛| (and can be adjusted
cor-without changing H by scaling ˛ and α).
m gi
5 If λ is small, then the optimization prioritizes the rectness term and uses a small margin if necessary If λ is
cor-large, the optimization must minimize a large-margin
in-correctness penalty A value for λ may be chosen by
cross-validation.
6 Kernelization: mapping the feature vectors to a higher
dimensional space allows us to find nonlinear separating surfaces in the original feature space.
Statistical Learning: Neural networks
1 A neural network function N :Rp→ R q is a tion of affine transformations and componentwise applica- tions of a function K : R → R.
composi-(i) We call K the activation function Common choices:
(a) u 7→ max(0, u) (rectifier, or ReLU) (b) u 7→ 1/(1 + e−u) (logistic)
(ii) Component-wise application of K onRtrefers to the function K.(x 1 , , x t ) = (K(x 1 ), , K(x t )).
(iii) An affine transformation from Rtto Rsis a map of the form A(x) =Wx+b, where W is an s × t ma-
trix and b∈ R s Entries of W are called weights and entries of b are called biases.
1
−2 3
input (xi ∈ R p )
−3 4
0
3
= 5 cost
2 The architecture of a neural network is the sequence of
dimensions of the domains and codomains of its affine maps.
For example, a neural net with W1∈ R 5×3 , W 2 ∈ R 4×5 , and
W 3 ∈ R 1×4 has architecture [3, 5, 4, 1].
3 Given training samples {(xi, yi)} N
i=1 , we obtain a neural net regression function by minimizing L(N) = n
∑
i=1 C(N(xi ), yi) where C(y , yi) = |y−yi | 2
4 For classification, we
(i) let yi = [0, , 0, 1, 0, 0] ∈ R|Y |, with the location of
the nonzero entry indicating class (this is called one-hot
encoding),
(ii) replace the identity map in the diagram with the
soft-max function u7→he uj/ ∑ n
k=1 e uk |Y | j=1 , and (iii) replace the cost function with C(y , yi ) = −log(y·yi ).
5 When the weight matrices are large, they have many rameters to tune We use a custom optimization scheme:
pa-(i) Start with random weights and a training input xi
(ii) Forward propagation: apply each successive map and
store the vectors at each green or purple node The
vec-tors stored at the green nodes are called activations.
(iii) Backpropagation: starting with the last green node
and working left, compute the change in cost per small change in the vector at each green or purple node By the chain rule, each such gradient is equal to the gradi- ent computed at right-adjacent node times the deriva- tive of map between the two nodes The derivative of A j
is W j, and the derivative of K is v7→ diagdKdu
(v) (iv) Compute the change in cost per small change in the weights and biases at each blue node Each such gra- dient is equal to the gradient stored at the next purple node times the derivative of the intervening affine map.
We have∂(Wx+b)
∂b = Iand v∂(Wx+b)
∂W =v0x0.
(v) Stochastic gradient descent: repeat (ii)–(iv) for each
sample in a randomly chosen subset of the training set and
determine the average desired change in weights and ases to reduce the cost function Update the weights and biases accordingly and iterate to convergence.
bi-Statistical Learning: Dimension reduction
1 The goal of dimension reduction is to map a set of n
points inR p to a lower-dimensional spaceR k while retaining
as much of the data’s structure as possible.
2 Dimension reduction can be used as a visualization aid
or as a feature pre-processing step in a machine learning model.
3 Structure may be taken to mean variation about the center,
in which case we use principal component analysis:
(i) Store the points’ components in an n × p matrix, (ii) de-mean each column,
(iii) compute the SVD UΣV 0 of the resulting matrix, and (iv) let W be the first k columns of V.
Then WW 0 : Rp→ R p is the rank-k projection matrix which minimizes the sum of squared projection distances of the points, and W 0 : Rp→ R k maps each point to its coordinates
in that k-dimensional subspace (with respect to the columns
of W).
-4 -2 0 2
2 4 6 8
MNIST handwritten digit images
first two principal components
4 Structure may be taken to mean pairwise proximity of
points, which stochastic neighbor embedding attempts to
preserve Given the data points x1, , xn and a parameter
ρ called the perplexity of the model, we define
Pi,j(σ) = e−|xi−xj|
2 /(2σ2 )
∑ k6=j e−|xk−xj|2/(2σ2 ),
Trang 72n (Pi,j(σj ) +Pj,i(σi )), which describes the
similarity of xiand xj Given y1, , yn in Rk, we define
MNIST handwritten
digit images
Statistics: Point estimation
1 The central problem of statistics is to make inferences
about a population or data-generating process based on the
information in a finite sample drawn from the population.
2 Parametric estimation involves an assumption that the
distribution of the data-generating process comes from a
family of distributions parameterized by finitely many real
numbers, while nonparametric estimation does not
Exam-ples: QDA is a parametric density estimator, while kernel density
estimation is nonparametric.
3 Point estimation is the inference of a single real-valued
feature of the distribution of the data-generating process
(such as its mean, variance, or median).
4 A statistical functional is any function T from the set of
distributions to [− ∞, ∞ ] An estimatorbθ is a random
vari-able defined in terms of n i.i.d random varivari-ables, the
pur-pose of which is to approximate some statistical functional of
the random variables’ common distribution Example:
Sup-pose that T( ν)= the mean of ν, and that θb = (X 1 + · · · + X n )/n.
5 The empirical measure bν of X1 , , X n is the
probabil-ity measure which assigns mass 1 to each sample’s location.
The plug-in estimator of θ= T(ν) is obtained by applying
T to the empirical measure: bθ= T( bν).
6 Given a distribution ν and a statistical functional T, let
θ= T(ν) The bias of an estimator of θ is the difference
be-tween the estimator’s expected value and θ Example: The
expectation of the sample meanbθ = (X1+ · · · + X n )/n is
E(X 1 + · · · + X n )/n = E[ν ], so the bias of the sample mean
10MSE is equal to variance plus squared bias Therefore,
MSE converges to zero as the number of samples goes to ∞
if and only if variance and bias both converge to zero.
1 Consider an unknown probability distribution ν from
which we get n independent samples X 1 , , X n , and
sup-pose that θ is the value of some statistical functional of ν.
A confidence interval for θ is an interval-valued function of
the sample data X 1 , , X n A confidence interval has
con-fidence level 1−α if it contains θ with probability at least
1 −α.
2 If bθ is unbiased, thenθ−k se(bθ), bθ+ k se(bθ) is a 1 − 1
k2 confidence interval, by Chebyshev’s inequality.
3 If bθ is unbiased and approximately normally distributed,
then
θ− 1.96 se(bθ), b θ+ 1.96 se(bθ) is an approximate 95%
confidence interval, since 95% of the mass of the standard normal distribution is in the interval [−1.96, 1.96].
4 Let I ⊂ R, and suppose that T is a function from the set of distributions to the set of real-valued functions on I A 1−α
confidence band for T(ν) is pair of random functions y min and y max from I to R defined in terms of n independent sam-
ples from ν and having ymin≤ T(ν) ≤ y max everywhere on
I with probability at least 1 −α.
Statistics: Empirical CDF convergence
1 Statistics is predicated on the idea that a distribution is well-approximated by indepen-
dent samples therefrom The
in probability.
2 The Dvoretzky-Kiefer-Wolfowitz inequality (DKW)
says that the graph of F b nlies in the ²-band around the graph
of F with probability at least 1 − 2e −2n²2
.
Statistics: Bootstrapping
1 Bootstrapping is the use of simulation to approximate
the value of the plug-in estimator of a statistical functional
which is expressed in terms of independent samples from ν.
Example: if θ= T(ν)is the variance of the median of 3 dent samples from ν, then the bootstrap estimate of θ is obtained as
indepen-a Monte Cindepen-arlo indepen-approximindepen-ation of T(bν): we sample 3 times (with
re-placement) from{X1, , X n }, record the median, repeat B times
for B large, and take the sample variance of the resulting list of B numbers.
2 The bootstrap approximation of T( bν) may be made as close to T( bν) as desired by choosing B large enough The difference between T(ν) and T( bν) is likely to be small if n is
large (that is, if many samples from ν are available).
3 The bootstrap is useful for computing standard errors, since the standard error of an estimator is often infeasible to compute analytically but conducive to Monte Carlo approx- imation.
Statistics: Maximum likelihood estimation
1 Maximum likelihood estimation is a general approach for proposing an estimator Consider a parametric family {f„(x) : „∈ R d }of PDFs or PMFs Given x∈ R n , the
likelihoodLx: Rd→ R is defined by
Lx(„) = f„(x 1 )f„(x 2 ) · · · f„(x n ).
If X is a vector of n independent samples drawn from f„(x), then LX(„)is small or zero when „ is not in accordance with
the observed data.
Example: Suppose x7→ f (x;θ)is the density of a uniform random
variable on[0,θ ] We observe four samples drawn from this
distri-bution: 1.41, 2.45, 6.12, and 4.9 Then the likelihood of θ=5 is zero, and the likelihood of θ= 10 6is very small.
2 The maximum likelihood estimator is
· · · + (X n − X) 2 ) So the maximum likelihood estimators agree
with the plug-in estimators.
3 MLE enjoys several nice properties: under certain
regu-larity conditions, we have (stated for θ∈ R 1 ):
(i) Consistency:E[(θb −θ) 2 ] → 0 as the number of samples goes to ∞.
(ii) Asymptotic normality:(bθ−θ)/p
Var bθ converges to
N (0, 1) as the number of samples goes to ∞.
(iii) Asymptotic optimality: the MSE of the MLE converges
to 0 approximately as fast as the MSE of any other sistent estimator.
con-4 Potential difficulties with the MLE:
(i) Computational challenges It might be hard to work
out where the maximum of the likelihood occurs, either analytically or numerically.
(ii) Misspecification The MLE may be inaccurate if the
dis-tribution of the samples is not in the specified ric family.
paramet-(iii) Unbounded likelihood If the likelihood function is not
bounded, then bθ is not well-defined.
Statistics: Hypothesis testing
1 Hypothesis testing is a disciplined framework for
adju-dicating whether observed data do not support a given pothesis.
hy-2 Consider an unknown distribution from which we will observe n samples X 1 , X n
(i) We state a hypothesis H 0–called the null
hypothe-sis–about the distribution.
(ii) We come up with a test statistic T, which is a function
of the data X 1 , X n , for which we can evaluate the tribution of T assuming the null hypothesis.
dis-(iii) We give an alternative hypothesis Ha under which T is expected to be significantly different from its value un- der H 0
(iv) We give a significance level α (like 5% or 1%), and based
on H a we determine a set of values for T—called the
critical region—which T would be in with probability at most α under the null hypothesis.
(v) After setting H0, H a, ¸, T , and the critical region,
we run the experiment, evaluate T on the samples we get, and record the result as t obs
(vi) If t obs falls in the critical region, we reject the null
hy-pothesis The corresponding p-value is defined to be the
minimum α-value which would have resulted in
reject-ing the null hypothesis, with the critical region chosen
in the same way*.
Example: Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first She is given eight cups of tea, four poured milk-first and four poured tea-first.
We posit a null hypothesis that she isn’t able to discern the ing method, under which the number of cups identified correctly is
pour-4 with probability 1/(8 ) ≈1.4% and at least 3 with probability
17/70 ≈24% Therefore, at the 5% significance level, only a rect identification of all the cups would give us grounds to reject the null hypothesis The p-value in that case would be 1.4%.
cor-3 Failure to reject the null hypothesis is not necessarily
evi-dence for the null hypothesis The power of a hypothesis test
is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true A p-value may
be low either because the null hypothesis is true or because the test has low power.
4 The Wald test is based on the normal approximation.
Consider a null hypothesis θ= 0 and the alternative
hypoth-esis θ6= 0, and suppose thatθ is approximately normallyb distributed The Wald test rejects the null hypothesis at the 5% significance level if |bθ| >1.96 se(bθ).
5 The random permutation test is applicable when the
null hypothesis is that the mean of a given random variable
is equal for two populations.
(i) We compute the difference between the sample means for the two groups.
(ii) We randomly re-assign the group labels and compute the resulting sample mean differences Repeat many times.
(iii) We check where the original difference falls in the sorted list of re-sampled differences.
Example: Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches Consider the null hypothesis that the expected heights are the same for the two families, and the alternative hypothesis that the Romero sons are taller on average (with α=5%) We find that the sample mean difference of about 2.4 inches is larger than 88.5%
of the mean differences obtained by resampling many times Since 88.5% < 95%, we retain the null hypothesis.
6 If we conduct many hypothesis tests, then the ity of obtaining some false rejections is high ( xkcd.com/882 ).
probabil-This is called the multiple testing problem The Bonferroni
method is to reject the null hypothesis only for those tests
whose p-values are less than α divided by the number of
hy-pothesis tests being run This ensures that the probability
of having even one false rejection is less than α, so it is very
conservative.
Statistics: dplyr and ggplot2
1 dplyr is an R package for manipulating data frames The following functions filter rows, sort rows, select columns, add columns, group, and aggregate the columns of a grouped data frame.
summarise(avgspeed =mean(speed,na.rm =TRUE))
2 ggplot2 is an R package for data visualization Graphics
are built as a sum of layers, which consist of a data frame,
a geom, a stat, and a mapping from the data to the geom’s
aesthetics (likex , y , color , or size ) The appearance of the plot can be customized with scale s, coord s, and theme s.
Trang 8Data Science Cheatsheet 2.0
Last Updated February 13, 2021
Statistics
Discrete Distributions
Binomial Bin(n, p) - number of successes in n events, each
with p probability If n = 1, this is the Bernoulli distribution
Geometric Geom(p) - number of failures before success
Negative Binomial NBin(r, p) - number of failures before r
successes
Hypergeometric HGeom(N, k, n) - number of k successes in
a size N population with n draws, without replacement
Poisson Pois(λ) - number of successes in a fixed time interval,
where successes occur independently at an average rate λ
Continuous Distributions
Normal/Gaussian N (µ, σ), Standard Normal Z ∼ N (0, 1)
Central Limit Theorem - sample mean of i.i.d data
approaches normal distribution
Exponential Exp(p) - time between independent events
occurring at an average rate λ
Gamma Gamma(p) - time until n independent events
occurring at an average rate λ
Hypothesis Testing
Significance Level α - probability of Type 1 error
p-value - probability of getting results at least as extreme as
the current test If p-value < α, or if test statistic > critical
value, then reject the null
Type I Error (False Positive) - null true, but reject
Type II Error (False Negative) - null false, but fail to reject
Power - probability of avoiding a Type II Error, and rejecting
the null when it is indeed false
Z-Test - tests whether population means/proportions are
different Assumes test statistic is normally distributed and is
used when n is large and variances are known If not, then use
a t-test Paired tests compare the mean at different points in
time, and two-sample tests compare means for two groups
ANOVA - analysis of variance, used to compare 3+ samples
with a single test
Chi-Square Test - checks relationship between categorical
variables (age vs income) Or, can check goodness-of-fit
between observed data and expected population distribution
Concepts
Learning
– Supervised - labeled data
– Unsupervised - unlabeled data
– Reinforcement - actions, states, and rewards
Cross Validation - estimate test error with a portion of
training data to validate accuracy and model parameters
– k-fold - divide data into k groups, and use one to validate
– leave-p-out - use p samples to validate and the rest to train
Parametric - assume data follows a function form with a
fixed number of parameters
Non-parametric - no assumptions on the data and an
unbounded number of parameters
Model Evaluation
Prediction Error = Bias2+ Variance + Irreducible NoiseBias - wrong assumptions when training → can’t captureunderlying patterns → underfit
Variance - sensitive to fluctuations when training→ can’tgeneralize on unseen data → overfit
Regression
Mean Squared Error (MSE) = n1P(yi− ˆy)2
Mean Absolute Error (MAE) =n1P |(yi− ˆy)|
Residual Sum of Squares =P(yi− ˆy)2
Total Sum of Squares =P(yi− ¯y)2
T P +F P, percent correct when predict positive– Recall, Sensitivity = T P
T P +F N, percent of actual positivesidentified correctly (True Positive Rate)
Precision-Recall Curve - focuses on the correct prediction
of class 1, useful when data or FP/FN costs are imbalanced
– Linear relationship and independent observations– Homoscedasticity - error terms have constant variance– Errors are uncorrelated and normally distributed– Low multicollinearity
RegularizationAdd a penalty for large coefficients to reduce overfittingSubset (L0): λ|| ˆβ||0= λ(number of non−zero variables)– Computationally slow, need to fit 2kmodels
– Alternatives: forward and backward stepwise selectionLASSO (L1): λ|| ˆβ||1= λP | ˆβ|
– Coefficients shrunk to zeroRidge (L2): λ|| ˆβ||2= λP( ˆβ)2
– Reduces effects of multicollinearityCombining LASSO and Ridge gives Elastic Net In all cases,
as λ grows, bias increases and variance decreases
Regularization can also be applied to many other algorithms
Logistic Regression
Predicts probability that Y belongs to a binary class (1 or 0).Fits a logistic (sigmoid) function to the data that maximizesthe likelihood that the observations follow the curve
Regularization can be added in the exponent
P (Y = 1) = 1
1 + e−(β0+βx)
Odds - output probability can be transformed usingOdds(Y = 1) =1−P (Y =1)P (Y =1) , where P (13) = 1:2 oddsAssumptions
– Linear relationship between X and log-odds of Y– Independent observations
– Low multicollinearity
Decision Trees
Classification and Regression Tree
CART for regression minimizes SSE by splitting data intosub-regions and predicting the average value at leaf nodes.Trees are prone to high variance, so tune through CV.Hyperparameters
– Complexity parameter, to only keep splits that improveSSE by at least cp (most influential, small cp → deep tree)– Minimum number of samples at a leaf node
– Minimum number of samples to consider a split
CART for classification minimizes the sum of region impurity,where ˆpiis the probability of a sample being in category i.Possible measures, each with a max impurity of 0.5
∼63% of the data, so the out-of-bag 37% can estimateprediction error without resorting to CV
Additional Hyperparameters (no cp):
– Number of trees to build– Number of variables considered at each splitDeep trees increase accuracy, but at a high computationalcost Model bias is always equal to one of its individual trees.Variable Importance - RF ranks variables by their ability tominimize error when split upon, averaged across all trees
Aaron Wang
Trang 9Naive Bayes
Classifies data using the label with the highest conditional
probability, given data a and classes c Naive because it
assumes variables are independent
Bayes’ Theorem P (ci|a) =P (a|ci )P (ci)
P (a)
Gaussian Naive Bayes - calculates conditional probability
for continuous data by assuming a normal distribution
Support Vector Machines
Separates data between two classes by maximizing the margin
between the hyperplane and the nearest data points of any
class Relies on the following:
Support Vector Classifiers - account for outliers by
allowing misclassifications on the support vectors (points in or
on the margin)
Kernel Functions - solve nonlinear problems by computing
the similarity between points a, b and mapping the data to a
higher dimension Common functions:
– Polynomial (ab + r)d
– Radial e−γ(a−b)2
Hinge Loss - max(0, 1 − yi(wTxi− b)), where w is the margin
width, b is the offset bias, and classes are labeled ±1 Note,
even a correct prediction inside the margin gives loss > 0
k-Nearest Neighbors
Non-parametric method that calculates ˆy using the average
value or most common class of its k-nearest points For
high-dimensional data, information is lost through equidistant
vectors, so dimension reduction is often applied prior to k-NN
Minkowski Distance = (P |ai− bi|p)1/p
– p = 1 gives Manhattan distanceP |ai− bi|
– p = 2 gives Euclidean distancepP(ai− bi)2
Hamming Distance - count of the differences between two
vectors, often used to compare categorical variables
to noise and outliers
k-means++ - improves selection of initial clusters
1 Pick the first center randomly
2 Compute distance between points and the nearest center
3 Choose new center using a weighted probabilitydistribution proportional to distance
4 Repeat until k centers are chosenEvaluating the number of clusters and performance:
Silhouette Value - measures how similar a data point is toits own cluster compared to other clusters, and ranges from 1(best) to -1 (worst)
Davies-Bouldin Index - ratio of within cluster scatter tobetween cluster separation, where lower values are
Hierarchical Clustering
Clusters data into groups using a predominant hierarchyAgglomerative Approach
1 Each observation starts in its own cluster
2 Iteratively combine the most similar cluster pairs
3 Continue until all points are in the same clusterDivisive Approach - all points start in one cluster and splitsare performed recursively down the hierarchy
Linkage Metrics - measure dissimilarity between clustersand combines them using the minimum linkage value over allpairwise points in different clusters by comparing:
– Single - the distance between the closest pair of points– Complete - the distance between the farthest pair of points– Ward’s - the increase in within-cluster SSE if two clusterswere to be combined
Dendrogram - plots the full hierarchy of clusters, where theheight of a node indicates the dissimilarity between its children
Dimension Reduction
Principal Component Analysis
Projects data onto orthogonal vectors that maximize variance.Remember, given an n × n matrix A, a nonzero vector ~x, and
a scaler λ, if A~x = λ~x then ~x and λ are an eigenvector andeigenvalue of A In PCA, the eigenvectors are uncorrelatedand represent principal components
1 Start with the covariance matrix of standardized data
2 Calculate eigenvalues and eigenvectors using SVD oreigendecomposition
3 Rank the principal components by their proportion ofvariance explained = λi
P λ
For a p-dimensional data, there will be p principal componentsSparse PCA - constrains the number of non-zero values ineach component, reducing susceptibility to noise andimproving interpretability
Linear Discriminant Analysis
Maximizes separation between classes and minimizes variancewithin classes for a labeled dataset
1 Compute the mean and variance of each independentvariable for every class
2 Calculate the within-class (σ2
w) and between-class (σ2)variance
3 Find the matrix W = (σ2
w)−1(σ2) that maximizes Fisher’ssignal-to-noise ratio
4 Rank the discriminant components by their signal-to-noiseratio λ
Assumptions– Independent variables are normally distributed– Homoscedasticity - constant variance of error– Low multicollinearity
Factor Analysis
Describes data using a linear combination of k latent factors.Given a normalized matrix X, it follows the form X = Lf + ,with factor loadings L and hidden factors f
Assumptions– E(X) = E(f ) = E() = 0– Cov(f ) = I → uncorrelated factors– Cov(f, ) = 0
Since Cov(X) = Cov(Lf ) + Cov(), then Cov(Lf ) = LL0Scree Plot - graphs the eigenvalues of factors (or principalcomponents) and is used to determine the number of factors toretain The ’elbow’ where values level off is often used as thecutoff
Aaron Wang
Trang 10Natural Language Processing
Transforms human language into machine-usable code
Processing Techniques
– Tokenization - splitting text into individual words (tokens)
– Lemmatization - reduces words to its base form based on
dictionary definition (am, are, is → be)
– Stemming - reduces words to its base form without context
(ended → end )
– Stop words - remove common and irrelevant words (the, is)
Markov Chain - stochastic and memoryless process that
predicts future events based only on the current state
n-gram - predicts the next term in a sequence of n terms
based on Markov chains
Bag-of-words - represents text using word frequencies,
without context or order
tf-idf - measures word importance for a document in a
collection (corpus), by multiplying the term frequency
(occurrences of a term in a document) with the inverse
document frequency (penalizes common terms across a corpus)
Cosine Similarity - measures similarity between vectors,
calculated as cos(θ) = ||A||||B||A·B , which ranges from o to 1
Word Embedding
Maps words and phrases to numerical vectors
word2vec - trains iteratively over local word context
windows, places similar words close together, and embeds
sub-relationships directly into vectors, such that
king − man + woman ≈ queen
Relies on one of the following:
– Continuous bag-of-words (CBOW) - predicts the word
given its context
– skip-gram - predicts the context given a word
GloVe - combines both global and local word co-occurence
data to learn word similarity
BERT - accounts for word order and trains on subwords, and
unlike word2vec and GloVe, BERT outputs different vectors
for different uses of words (cell phone vs blood cell)
Sentiment Analysis
Extracts the attitudes and emotions from text
Polarity - measures positive, negative, or neutral opinions
– Valence shifters - capture amplifiers or negators such as
’really fun’ or ’hardly fun’
Sentiment - measures emotional states such as happy or sad
Subject-Object Identification - classifies sentences as
either subjective or objective
Topic Modelling
Captures the underlying themes that appear in documents
Latent Dirichlet Allocation (LDA) - generates k topics by
first assigning each word to a random topic, then iteratively
updating assignments based on parameters α, the mix of topics
per document, and β, the distribution of words per topic
Latent Semantic Analysis (LSA) - identifies patterns using
tf-idf scores and reduces data to k dimensions through SVD
1
e z +e −z
Since a system of linear activation functions can be simplified
to a single perceptron, nonlinear functions are commonly usedfor more accurate tuning and meaningful gradients
Loss Function - measures prediction error using functionssuch as MSE for regression and binary cross-entropy forprobability-based classification
Gradient Descent - minimizes the average loss by movingiteratively in the direction of steepest descent, controlled bythe learning rate γ (step size) Note, γ can be updatedadaptively for better performance For neural networks,finding the best set of weights involves:
1 Initialize weights W randomly with near-zero values
2 Loop until convergence:
– Calculate the average network loss J (W )– Backpropagation - iterate backwards from the lastlayer, computing the gradient ∂J (W )∂W and updating theweight W ← W − γ∂J (W )∂W
3 Return the minimum loss weight matrix W
To prevent overfitting, regularization can be applied by:
– Stopping training when validation performance drops– Dropout - randomly drop some nodes during training toprevent over-reliance on a single node
– Embedding weight penalties into the objective functionStochastic Gradient Descent - only uses a single point tocompute gradients, leading to smoother convergence and fastercompute speeds Alternatively, mini-batch gradient descenttrains on small subsets of the data, striking a balance betweenthe approaches
Convolutional Neural Network
Analyzes structural or visual data by extracting local featuresConvolutional Layers - iterate over windows of the image,applying weights, bias, and an activation function to createfeature maps Different weights lead to different features maps
Pooling - downsamples convolution layers to reducedimensionality and maintain spatial invariance, allowingdetection of features even if they have shifted slightly.Common techniques return the max or average value in thepooling window
The general CNN architecture is as follows:
1 Perform a series of convolution, ReLU, and poolingoperations, extracting important features from the data
2 Feed output into a fully-connected layer for classification,object detection, or other structural analyses
Recurrent Neural Network
Predicts sequential data using a temporally connected systemthat captures both new inputs and previous outputs usinghidden states
RNNs can model various input-output scenarios, such asmany-to-one, one-to-many, and many-to-many Relies onparameter (weight) sharing for efficiency To avoid redundantcalculations during backpropagation, downstream gradientsare found by chaining previous gradients However, repeatedlymultiplying values greater than or less than 1 leads to:– Exploding gradients - model instability and overflows– Vanishing gradients - loss of learning ability
This can be solved using:
– Gradient clipping - cap the maximum value of gradients– ReLU - its derivative prevents gradient shrinkage for x > 0– Gated cells - regulate the flow of information
Long Short-Term Memory - learns long-term dependenciesusing gated cells and maintains a separate cell state from what
is outputted Gates in LSTM perform the following:
1 Forget and filter out irrelevant info from previous layers
2 Store relevant info from current input
3 Update the current cell state
4 Output the hidden state, a filtered version of the cell stateLSTMs can be stacked to improve performance
Aaron Wang
Trang 11Boosting
Ensemble method that learns by sequentially fitting many
simple models As opposed to bagging, boosting trains on all
the data and combines weak models using the learning rate α
Boosting can be applied to many machine learning problems
AdaBoost - uses sample weighting and decision ’stumps’
(one-level decision trees) to classify samples
1 Build decision stumps for every feature, choosing the one
with the best classification accuracy
2 Assign more weight to misclassified samples and reward
trees that differentiate them, where α =1ln1−T otalErrorT otalError
3 Continue training and weighting decision stumps until
convergence
Gradient Boost - trains sequential models by minimizing a
given loss function using gradient descent at each step
1 Start by predicting the average value of the response
2 Build a tree on the errors, constrained by depth or the
number of leaf nodes
3 Scale decision trees by a constant learning rate α
4 Continue training and weighting decision trees until
convergence
XGBoost - fast gradient boosting method that utilizes
regularization and parallelization
Recommender Systems
Suggests relevant items to users by predicting ratings and
preferences, and is divided into two main types:
– Content Filtering - recommends similar items
– Collaborative Filtering - recommends what similar users like
The latter is more common, and includes methods such as:
Memory-based Approaches - finds neighborhoods by using
rating data to compute user and item similarity, measured
using correlation or cosine similarity
– User-User - similar users also liked
– Leads to more diverse recommendations, as opposed to
just recommending popular items
– Suffers from sparsity, as the number of users who rate
items is often low
– Item-Item - similar users who liked this item also liked
– Efficient when there are more users than items, since the
item neighborhoods update less frequently than users
– Similarity between items is often more reliable than
similarity between users
Model-based Approaches - predict ratings of unrated
items, through methods such as Bayesian networks, SVD, and
clustering Handles sparse data better than memory-based
approaches
– Matrix Factorization - decomposes the user-item rating
matrix into two lower-dimensional matrices representing the
users and items, each with k latent factors
Recommender systems can also be combined through ensemble
methods to improve performance
Reinforcement Learning
Maximizes future rewards by learning through state-actionpairs That is, an agent performs actions in an environment,which updates the state and provides a reward
Multi-armed Bandit Problem - a gambler plays slotmachines with unknown probability distributions and mustdecide the best strategy to maximize reward This exemplifiesthe exploration-exploitation tradeoff, as the best long-termstrategy may involve short-term sacrifices
RL is divided into two types, with the former being morecommon:
– Model-free - learn through trial and error in theenvironment
– Model-based - access to the underlying (approximate)state-reward distribution
Q-Value Q(s, a) - captures the expected discounted totalfuture reward given a state and action
Policy - chooses the best actions for an agent at various statesπ(s) = arg max
a
Q(s, a)Deep RL algorithms can further be divided into two maintypes, depending on their learning objective
Value Learning - aims to approximate Q(s, a) for all actionsthe agent can take, but is restricted to discrete action spaces
Can use the -greedy method, where measures theprobability of exploration If chosen, the next action isselected uniformly at random
– Q-Learning - simple value iteration model that maximizesthe Q-value using a table on states and actions
– Deep Q Network - finds the best action to take byminimizing the Q-loss, the squared error between the targetQ-value and the prediction
Policy Gradient Learning - directly optimize the the policyπ(s) through a probability distribution of actions, without theneed for a value function, allowing for continuous actionspaces
Actor-Critic Model - hybrid algorithm that relies on twoneural networks, an actor π(s, a, θ) which controls agentbehavior and a critic Q(s, a, w) that measures how good anaction is Both run in parallel to find the optimal weights θ, w
to maximize expected reward At each step:
1 Pass the current state into the actor and critic
2 The critic evaluates the action’s Q-value, and the actorupdates its weight θ
3 The actor takes the next action leading to a new state, andthe critic updates its weight w
Anomaly Detection
Identifies unusual patterns that differ from the majority of thedata, and can be applied in supervised, unsupervised, andsemi-supervised scenarios Assumes that anomalies are:– Rare - the minority class that occurs rarely in the data– Different - have feature values that are very different fromnormal observations
Anomaly detection techniques spans a wide range, includingmethods based on:
Statistics - relies on various statistical methods to identifyoutliers, such as Z-tests, boxplots, interquartile ranges, andvariance comparisons
Density - useful when data is grouped around denseneighborhoods, measured by distance Methods includek-nearest neighbors, local outlier factor, and isolation forest.– Isolation Forest - tree-based model that labels outliersbased on an anomaly score
1 Select a random feature and split value, dividing thedataset in two
2 Continue splitting randomly until every point is isolated
3 Calculate the anomaly score for each observation, based
on how many iterations it took to isolate that point
4 If the anomaly score is greater than a threshold, mark it
as an outlierIntuitively, outliers are easier to isolate and should haveshorter path lengths in the tree
Clusters - data points outside of clusters could potentially bemarked as anomalies
Autoencoders - unsupervised neural networks that compressdata and reconstructs it The network has two parts: anencoder that embeds data to a lower dimension, and a decoderthat produces a reconstruction Autoencoders do not
reconstruct the data perfectly, but rather focus on capturingimportant features in the data
Upon decoding, the model will accurately reconstruct normalpatterns but struggle with anomalous data The reconstructionerror is used as an anomaly score to detect outliers
Autoencoders are applied to many problems, including imageprocessing, dimension reduction, and information retrieval
Aaron Wang
Trang 12M ACHINE L EARNING C HEATSHEET
Summary of Machine Learning Algorithms descriptions,
Maximum Likelihood Estimation
MLE is used to find the estimators that minimizes the
likelihood function:
Linear Algorithms
dimension of the hyperplane of the regression is its
complexity
Trang 13Learning:
E
?CD
)F+ 𝜆 |𝛽9|
B 9CD
= 𝑅𝑆𝑆 + 𝜆 |𝛽9|
B 9CD
E
?CD
)F+ 𝜆 𝛽9FB 9CD
= 𝑅𝑆𝑆 + 𝜆 𝛽9F
B 9CD
where 𝜆 ≥ 0 is a tuning parameter to be determined
Usecase examples:
- Customer scoring with probability of purchase
- Classification of loan defaults according to profile
Linear Discriminant Analysis
For multiclass classification, LDA is the preferred linear
Trang 14The most common Stopping Criterion for splitting is a minimum of training observations per node
The simplest form of pruning is Reduced Error Pruning:
Starting at the leaves, each node is replaced with its most popular class If the prediction accuracy is not affected, then the change is kept
Advantages:
+ Easy to interpret and no overfitting with pruning + Works for both regression and classification problems + Can take any type of variables without modifications, and
Usecase examples:
- Fraudulent transaction classification
- Predict human resource allocation in companies
Naive Bayes Classifier
Naive Bayes is a classification algorithm interested in
Trang 15Bagging and Random Forest
Random Forest is part of a bigger type of ensemble
methods called Bootstrap Aggregation or Bagging Bagging can reduce the variance of high-variance models
It uses the Bootstrap statistical procedure: estimate a
quantity from a sample by creating many random subsamples with replacement, and computing the mean of each subsample
Trang 16Boosting and AdaBoost