Bài 2 Slide Linear Regression

Bài 2 Slide Linear Regression. Linear Regression Linear Regression Regression Given – Data X = x(1), , x(n) – Corresponding labels y = where x(i) y(1), , y(n) where y(i) 2 R 2 9 8 7 6 5 4 3 2 1 0 1975 1980 1985 1990 1995 2000 2005.

Trang 1

Linear Regression

Trang 3

• 97 samples, partitioned into 67 train / 30 test

• Eight predictors (features):

– 6 continuous (4 log transforms), 1 binary, 1 ordinal

• Continuous outcome variable:

– lpsa: log(prostate specific antigen level)

Prostate Cancer Dataset

Based on slide by Jeﬀ Howbert

Trang 5

Least Squares Linear Regression

Trang 6

Based on example by Andrew

Intuition Behind Cost Function

Trang 7

Trang 8

Trang 9

Trang 10

11 Slide by Andrew Ng

Trang 11

Trang 12

Trang 13

Trang 14

Trang 15

Basic Search Procedure

• Choose initial value for

• Until we reach a minimum:

✓

1 0

16 Figure by Andrew Ng

Trang 16

✓

1 0

17 Figure by Andrew Ng

Trang 17

✓

Since the least squares objective function is conv1ex (concave),

we don’t ne0ed to worry about local minima

Trang 18

J (✓) simultaneous update for j = 0 d

learning rate (small) e.g., α = 0.05

Trang 19

For Linear Regression:

Trang 20

✓x — y (i )

! 2

Trang 21

✓x — y (i )

!

Trang 22

Trang 23

Gradient Descent for Linear Regression

• To achieve simultaneous update

• At the start of each GD iteration, compute h✓

• Use this stored value in the update step loop

x( i )

2

s X

Trang 24

Gradient Descent

h(x) = -900 – 0.1 x

Trang 25

Gradient Descent

Trang 26

Gradient Descent

Trang 27

Gradient Descent

Trang 28

Gradient Descent

Trang 29

Gradient Descent

Trang 30

Gradient Descent

Trang 31

Gradient Descent

Trang 32

Gradient Descent

Trang 33

Increasing value for J(✓)

To see if gradient descent is working, print out J(✓) each iteration

Trang 34

Extending Linear Regression to More Complex

Models

• e.g log, exp, square root, square, etc.

• example: x3 = x1 x2

This allows use of linear regression techniques to fit non-linear datasets.

Trang 35

Linear Basis Function Models

• Generally,

• In the simplest case, we use linear basis functions :

Trang 36

• Polynomial basis functions:

aﬀects all basis functions

• Gaussian basis functions:

basis functions μj and s control location and scale

(width).

Based on slide by Christopher Bishop (PRML)

Trang 37

• Sigmoidal basis functions:

where

– These are also local; a small change in x only aﬀects nearby basis functions μj and s

control location and scale (slope).

Based on slide by Christopher Bishop (PRML)

Trang 38

Example of Fitting a Polynomial Curve with a Linear Model

Trang 39

• Basic Linear Model:

• Generalized Linear Model:

• Once we have replaced the data by the outputs of the basis functions, fitting the

generalized model is exactly the same problem as fitting the basic model

Based on slide by Geoﬀ Hinton

Trang 40

Linear Algebra Concepts

Trang 41

• Transpose: reﬂect vector/matrix on line:

Trang 42

Based on slides by Joseph Bradley

• Vector dot product:

Trang 43

Based on slides by Joseph Bradley

Linear Algebra Concepts

Trang 44

✓1

✓d

7

Trang 45

1 1

x( 1 ) 1

x( 1 )

d

.

1

✓ =

2

6

64

✓

0 1

✓d

77

R(d + 1 ) ⇥ 1 Rn ⇥ ( d + 1 )

Trang 46

1 2n

y( n )

7 7

Trang 47

Closed Form Solution: ✓ = (X | X )—

Closed Form Solution

• Instead of using GD, solve for optimal ✓ analytically

Trang 48

Closed Form Solution

• If X T X is not invertible (i.e., singular), may need to:

• In python, numpy.linalg.pinv(a)

y =

2

y ( 1)

6 6 4

y( 2 ) .

y( n )

7 7

X =

2

6 6

6 4

1 x1( 1 )

x ( 1 )

d

.

x (n )

.

Trang 49

Gradient Descent vs Closed Form

Trang 50

Improving Learning: Feature

Scaling

• Makes gradient descent converge much faster

20

15

10

5 0

0 5 10 15 20

✓ 1

✓ 2

20 15 10 5 0

Trang 51

Feature Standardization

j

– Let μ be the mean of feature j:

• sj is the standard deviation of feature j

for sj

• Must apply the same transformation to instances for both training and prediction

• Outliers can cause problems

Trang 52

Quality of Fit

Underfitting (high bias)

Overﬁtting:

• The learned hypothesis may fit the training set very well ( J (✓) ⇡ 0 )

• but fails to generalize to new examples

Correct fit

Based on example by Andrew Ng

Trang 53

• Can also address overfitting by eliminating features (either manually or via model

selection)

Trang 54

– λ is the regularization parameter (λ

0)

J (✓) =

1 2n

Trang 55

Understanding Regularization

• Note that

j = 1

– This is the magnitude of the feature coefficient vector!

• We can also think of this as:

λ2

d

X

j = 1

✓j 2

Trang 58

Regularized Linear Regression

✓0 ← ✓0 — ↵

1n

λ2

Trang 59

60

1 n

✓j ← ✓j — ↵

1 n

Trang 60

Trang 61

• To incorporate regularization into the closed form

solution:

• Can derive this the same way, by solving

• Can prove that for λ > 0, inverse exists in the equation

Trang 62

Logistic Regression

Trang 63

Classification Based on Probability

• Instead of just predicting the class, give the probability of the instance being that class

• Comparison to perceptron:

Trang 64

Logistic / Sigmoid Function

Trang 65

Interpretation of Hypothesis Output

Therefore, p(y = 0 | x ; ✓) = 1 — p(y = 1 | x ; ✓)

Based on example by Andrew Ng

Trang 66

Another Interpretation

• Equivalently, logistic regression assumes that

• In other words, logistic regression assumes that the log odds is a linear function of

5

p(y = 0 | x ; ✓)

p(y = 1 | x ; ✓)

Side Note: the odds in favor of an event is the quantity

p / (1 − p), where p is the probability of the event

E.g., If I toss a fair dice, what are the odds that I will have a 6?

odds of y = 1

Based on slide by Xiaoli Fern

Trang 67

Based on slide by Andrew Ng

✓| x should be large negative values for negative instances

✓| x should be large positive values for positive instances

Trang 68

Non-Linear Decision Boundary

• Can apply basis function expansion to features, same as with linear regression

1

68

x =

x

1 2

2

6666666664

x x

1

x x

2 2 2

77777777

Trang 69

✓1

Trang 70

Logistic Regression Objective Function

• Can’t just use squared loss as in linear regression:

results in a non-convex optimization

Trang 71

Deriving the Cost Function via Maximum Likelihood Estimation

• Likelihood of data is given by:

Trang 72

Deriving the Cost Function via Maximum Likelihood Estimation

• Substitute in model, and take negative to yield

Logistic regression objective:

Trang 73

Intuition Behind the Objective

• Can re-write objective function as

Trang 74

Trang 75

cost

75 Based on example by Andrew Ng

Trang 76

cost ( h✓ ( x ) , y) =

— log(1 — h✓ ( x ) ) if y = 0

1 0

cost

If y = 1 If y

= 0

76 Based on example by Andrew Ng

Trang 77

Regularized Logistic Regression

• We can regularize logistic regression exactly as before:

= J (✓) + k✓ [1:d ] k 2

2

Trang 78

Gradient Descent for Logistic Regression

• Repeat until convergence

Trang 79

Trang 80

• Repeat until convergence

• Initialize

1

Trang 81

Multi-Class Classification

x1

Disease diagnosis:

x1

healthy / cold / ﬂu / pneumonia

Object classification: desk / chair / monitor / bookcase

81

Trang 83

Multi-Class Logistic Regression

• Train a logistic regression classifier for each class

Trang 84

exp(✓Tx)

C

c = 1 exp(✓Tx)c

Implementing Multi-Class Logistic

Regression

c

• Gradient descent simultaneously updates all parameters for all models

– Same derivative as before, just with

the above hc(x)

• Predict class label as the most probable label

max h c (x )

Định dạng
Số trang	84
Dung lượng	1,99 MB