Bài 2 Slide Linear Regression. Linear Regression Linear Regression Regression Given – Data X = x(1), , x(n) – Corresponding labels y = where x(i) y(1), , y(n) where y(i) 2 R 2 9 8 7 6 5 4 3 2 1 0 1975 1980 1985 1990 1995 2000 2005.
Trang 1Linear Regression
Trang 3• 97 samples, partitioned into 67 train / 30 test
• Eight predictors (features):
– 6 continuous (4 log transforms), 1 binary, 1 ordinal
• Continuous outcome variable:
– lpsa: log(prostate specific antigen level)
Prostate Cancer Dataset
Based on slide by Jeff Howbert
Trang 5Least Squares Linear Regression
Trang 6Based on example by Andrew
Intuition Behind Cost Function
Trang 7Intuition Behind Cost Function
Trang 8Intuition Behind Cost Function
Trang 9Intuition Behind Cost Function
Trang 10Intuition Behind Cost Function
11 Slide by Andrew Ng
Trang 11Intuition Behind Cost Function
12 Slide by Andrew Ng
Trang 12Intuition Behind Cost Function
13 Slide by Andrew Ng
Trang 13Intuition Behind Cost Function
14 Slide by Andrew Ng
Trang 14Intuition Behind Cost Function
15 Slide by Andrew Ng
Trang 15Basic Search Procedure
• Choose initial value for
• Until we reach a minimum:
✓
1 0
16 Figure by Andrew Ng
Trang 16Basic Search Procedure
• Choose initial value for
• Until we reach a minimum:
✓
✓
1 0
17 Figure by Andrew Ng
Trang 17Basic Search Procedure
• Choose initial value for
• Until we reach a minimum:
✓
✓
Since the least squares objective function is conv1ex (concave),
we don’t ne0ed to worry about local minima
Trang 18J (✓) simultaneous update for j = 0 d
learning rate (small) e.g., α = 0.05
Trang 19J (✓) simultaneous update for j = 0 d
For Linear Regression:
Trang 20J (✓) simultaneous update for j = 0 d
For Linear Regression:
✓x — y (i )
! 2
Trang 21J (✓) simultaneous update for j = 0 d
For Linear Regression:
✓x — y (i )
!
Trang 22J (✓) simultaneous update for j = 0 d
For Linear Regression:
Trang 23Gradient Descent for Linear Regression
• To achieve simultaneous update
• At the start of each GD iteration, compute h✓
• Use this stored value in the update step loop
x( i )
2
s X
Trang 24Gradient Descent
h(x) = -900 – 0.1 x
25 Slide by Andrew Ng
Trang 25Gradient Descent
26 Slide by Andrew Ng
Trang 26Gradient Descent
27 Slide by Andrew Ng
Trang 27Gradient Descent
28 Slide by Andrew Ng
Trang 28Gradient Descent
29 Slide by Andrew Ng
Trang 29Gradient Descent
30 Slide by Andrew Ng
Trang 30Gradient Descent
31 Slide by Andrew Ng
Trang 31Gradient Descent
32 Slide by Andrew Ng
Trang 32Gradient Descent
33 Slide by Andrew Ng
Trang 33Increasing value for J(✓)
To see if gradient descent is working, print out J(✓) each iteration
Trang 34Extending Linear Regression to More Complex
Models
• e.g log, exp, square root, square, etc.
• example: x3 = x1 x2
This allows use of linear regression techniques to fit non-linear datasets.
Trang 35Linear Basis Function Models
• Generally,
• In the simplest case, we use linear basis functions :
Trang 36Linear Basis Function Models
• Polynomial basis functions:
affects all basis functions
• Gaussian basis functions:
basis functions μj and s control location and scale
(width).
Based on slide by Christopher Bishop (PRML)
Trang 37Linear Basis Function Models
• Sigmoidal basis functions:
where
– These are also local; a small change in x only affects nearby basis functions μj and s
control location and scale (slope).
Based on slide by Christopher Bishop (PRML)
Trang 38Example of Fitting a Polynomial Curve with a Linear Model
Trang 39Linear Basis Function Models
• Basic Linear Model:
• Generalized Linear Model:
• Once we have replaced the data by the outputs of the basis functions, fitting the
generalized model is exactly the same problem as fitting the basic model
Based on slide by Geoff Hinton
Trang 40Linear Algebra Concepts
Trang 41• Transpose: reflect vector/matrix on line:
Trang 42Based on slides by Joseph Bradley
• Vector dot product:
Trang 43Based on slides by Joseph Bradley
Linear Algebra Concepts
Trang 44✓1
✓d
7
Trang 451 1
x( 1 ) 1
x( 1 )
d
.
1
✓ =
2
6
64
✓
✓
0 1
✓d
77
R(d + 1 ) ⇥ 1 Rn ⇥ ( d + 1 )
Trang 461 2n
1 2n
y( n )
7 7
Trang 47Closed Form Solution: ✓ = (X | X )—
Closed Form Solution
• Instead of using GD, solve for optimal ✓ analytically
Trang 48Closed Form Solution
• If X T X is not invertible (i.e., singular), may need to:
• In python, numpy.linalg.pinv(a)
y =
2
y ( 1)
6 6 4
y( 2 ) .
y( n )
7 7
X =
2
6 6
6 4
1 x1( 1 )
x ( 1 )
d
.
.
x (n )
.
.
Trang 49Gradient Descent vs Closed Form
Trang 50Improving Learning: Feature
Scaling
• Makes gradient descent converge much faster
20
15
10
5 0
0 5 10 15 20
✓ 1
✓ 2
20 15 10 5 0
Trang 51Feature Standardization
j
– Let μ be the mean of feature j:
• sj is the standard deviation of feature j
for sj
• Must apply the same transformation to instances for both training and prediction
• Outliers can cause problems
Trang 52Quality of Fit
Underfitting (high bias)
Overfitting:
• The learned hypothesis may fit the training set very well ( J (✓) ⇡ 0 )
• but fails to generalize to new examples
Correct fit
Based on example by Andrew Ng
Trang 53• Can also address overfitting by eliminating features (either manually or via model
selection)
Trang 54– λ is the regularization parameter (λ
0)
J (✓) =
1 2n
Trang 55Understanding Regularization
• Note that
j = 1
– This is the magnitude of the feature coefficient vector!
• We can also think of this as:
λ2
d
X
j = 1
✓j 2
Trang 58Regularized Linear Regression
✓0 ← ✓0 — ↵
1n
λ2
Trang 59Regularized Linear Regression
60
1 n
✓j ← ✓j — ↵
1 n
Trang 60Regularized Linear Regression
Trang 61Regularized Linear Regression
• To incorporate regularization into the closed form
solution:
• Can derive this the same way, by solving
• Can prove that for λ > 0, inverse exists in the equation
Trang 62Logistic Regression
Trang 63Classification Based on Probability
• Instead of just predicting the class, give the probability of the instance being that class
• Comparison to perceptron:
Trang 64Logistic / Sigmoid Function
Trang 65Interpretation of Hypothesis Output
Therefore, p(y = 0 | x ; ✓) = 1 — p(y = 1 | x ; ✓)
Based on example by Andrew Ng
Trang 66Another Interpretation
• Equivalently, logistic regression assumes that
• In other words, logistic regression assumes that the log odds is a linear function of
5
p(y = 0 | x ; ✓)
p(y = 1 | x ; ✓)
Side Note: the odds in favor of an event is the quantity
p / (1 − p), where p is the probability of the event
E.g., If I toss a fair dice, what are the odds that I will have a 6?
odds of y = 1
Based on slide by Xiaoli Fern
Trang 67Based on slide by Andrew Ng
✓| x should be large negative values for negative instances
✓| x should be large positive values for positive instances
Trang 68Non-Linear Decision Boundary
• Can apply basis function expansion to features, same as with linear regression
1
68
x =
x
x
1 2
2
6666666664
x x
1
x x
2 2 2
77777777
Trang 69✓1
Trang 70Logistic Regression Objective Function
• Can’t just use squared loss as in linear regression:
results in a non-convex optimization
Trang 71Deriving the Cost Function via Maximum Likelihood Estimation
• Likelihood of data is given by:
Trang 72Deriving the Cost Function via Maximum Likelihood Estimation
• Substitute in model, and take negative to yield
Logistic regression objective:
Trang 73Intuition Behind the Objective
• Can re-write objective function as
Trang 74Intuition Behind the Objective
Trang 75Intuition Behind the Objective
cost
75 Based on example by Andrew Ng
Trang 76Intuition Behind the Objective
cost ( h✓ ( x ) , y) =
— log(1 — h✓ ( x ) ) if y = 0
1 0
cost
If y = 1 If y
= 0
76 Based on example by Andrew Ng
Trang 77Regularized Logistic Regression
• We can regularize logistic regression exactly as before:
= J (✓) + k✓ [1:d ] k 2
2
Trang 78Gradient Descent for Logistic Regression
• Repeat until convergence
Trang 79Gradient Descent for Logistic Regression
Trang 80Gradient Descent for Logistic Regression
• Repeat until convergence
• Initialize
1
Trang 81Multi-Class Classification
x1
Disease diagnosis:
x1
healthy / cold / flu / pneumonia
Object classification: desk / chair / monitor / bookcase
81
Trang 83Multi-Class Logistic Regression
• Train a logistic regression classifier for each class
Trang 84exp(✓Tx)
C
c = 1 exp(✓Tx)c
Implementing Multi-Class Logistic
Regression
c
• Gradient descent simultaneously updates all parameters for all models
– Same derivative as before, just with
the above hc(x)
• Predict class label as the most probable label
max h c (x )