Principal component analysis PCA Mean expected value: giá trị “mong muốn”, biểu diễn giá trị trung bình của một biến.. Principal component analysis PCA Representing Covariance between
Trang 1Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email: trinhtandat@sgu.edu.vn
Website: https://sites.google.com/site/ttdat88/
Trang 2 Introduction: dimensionality reduction and feature selection
Dimensionality Reduction
Principal Component Analysis (PCA)
Fisher’s linear discriminant analysis (LDA)
Example: Eigenface
Feature Selection
Homework
Trang 3 High-dimensional data often contain redundant features
reduce the accuracy of data classification algorithms
slow down the classification process
be a problem in storage and retrieval
hard to interpret (visualize)
Why we need dimensionality reduction???
To avoid “curse of dimensionality”
To reduce feature measurement cost
To reduce computational cost
https://en.wikipedia.org/wiki/Curse_of_dimensionality
Trang 5 Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant) and redundant features.
Dimensionality reduction techniques: feature extraction v.s feature selection
feature extraction: given N features (set X), extract M new features (set Y) by linear or linear combination of all the N features (i.e PCA, LDA)
non- feature selection: choose a best subset of highly discriminant features of size M from the
available N features (i.e Information Gain, ReliefF, Fisher Score)
Trang 6Dimensionality Reduction
Trang 7Principal component analysis (PCA)
❖ Variance v.s Covariance
Variance : phương sai của một biến ngẫu nhiên là thước đo sự phân tán thống kê của
biến đó, nó hàm ý các giá trị của biến đó thường ở cách giá trị kỳ vọng bao xa
Covariance: hiệp phương sai là độ đo sự biến thiên cùng nhau của hai biến ngẫu nhiên(phân biệt với phương sai - đo mức độ biến thiên của một biến)
Trang 8Principal component analysis (PCA)
Mean (expected value): giá trị “mong muốn”,
biểu diễn giá trị trung bình của một biến
Standard Deviation: Độ lệch chuẩn đo tính
biến động của giá trị mang tính thống kê Nó
cho thấy sự chênh lệch về giá trị của từng thời
điểm đánh giá so với giá trị trung bình
Trang 9Principal component analysis (PCA)
Representing Covariance between dimensions as a matrix e.g for 3 dimensions:
cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal
N-dimensional data will result in NxN covariance matrix
Trang 10Principal component analysis (PCA)
What is the interpretation of covariance calculations?
e.g.: dữ liệu 2 chiều
x: số lượng giờ học một môn học
y: điểm số của một môn học
covariance value ~ 104.53
what does this value mean?
-> số lượng giờ học tăng , điểm số
Trang 11Principal component analysis (PCA)
Exact value is not as important as it’s sign.
A positive value of covariance indicates both dimensions increase or
decrease together (e.g as the number of hours studied increases, the marks
in that subject increase.)
A negative value indicates while one increases the other decreases, or
vice-versa (e.g active social life v.s performance in class.)
If covariance is zero: the two dimensions are independent of each other
(e.g heights of students vs the marks obtained in a subject.)
Trang 12Principal component analysis (PCA)
Trang 13 Principal components analysis (PCA) là một phương pháp để đơn giản hóa một tập dữ liệu (simplify a dataset) , chằng hạn giảm số chiều của dữ liệu
“It is a linear transformation that chooses a new coordinate system for the data set such that
the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component),
the second greatest variance on the second axis
and so on ”
PCA có thể được dùng để giảm số chiều bằng cách loại bỏ những thành phần chính không quan trọng.
Trang 14Principal component analysis (PCA)
Ví dụ:
khi thực hiện các phân tích đa biến mà
trong đó các biến có tương quan với nhau
gây nhiều khó khăn
loại bỏ sự tương quan này bằng cách xoay trục (cơ sở)
dữ liệu trên trục mới đã giảm
sự tương quan đáng kể (biến Y1 và Y2 gần như không tương quan)
sự thay đổi của dữ liệu phụ thuộc phần lớn vào biến Y1
giảm số chiều dữ liệu mà không làm giàm quá nhiều
“phương sai” của dữ liệu
Trang 15Principal component analysis (PCA)
Note:
Giúp giảm số chiều của dữ liệu;
Thay vì giữ lại các trục tọa độ của không gian cũ, PCA xây dựng một không gian mới ít chiều hơn, nhưng lại có khả năng biểu diễn dữ liệu tốt tương đương không gian cũ
Trang 16Principal component analysis (PCA)
Ví dụ:
Khám phá liên kết tiềm ẩn nhờ đổi hệ trục tọa độ, cách nhìn khác nhau về cùng một dữ liệu
Trang 17Principal component analysis (PCA)
Ví dụ:
Notice that "the maximum variance" and "the minimum error" are reached at the same time, namely when the line points to the magenta ticks
Trang 18Principal component analysis (PCA)
How to find the optimal linear transformation A ( where y = Ax)
-1 Origin of PCA coordinate mean of samples
-2 Maximize projected variance
-3 Minimize projection cost min x y −
Trang 19Principal component analysis (PCA)
Analysis/
Trang 20https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-Principal component analysis (PCA)
Note:
The eigenvectors of the covariance matrix define a new coordinate system
Eigenvector with largest eigenvalue captures the most variation amongtraining vectors x
eigenvector with smallest eigenvalue has least variation
The eigenvectors are known as principal components
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-Analysis/
Trang 21Principal component analysis (PCA)
Algorithm: 7 steps
Input : N mẫu input
1 Tính vector trung bình của toàn bộ dữ liệu:
2 Trừ mỗi điểm dữ liệu đi vector trung bình của toàn bộ dữ liệu:
Note: subtracting the mean is equivalent to translating the coordinate system to the location of the mean
Trang 22Principal component analysis (PCA)
3 Tính covaricance matrix
4 Perform the eigendecomposition : tìm eigenvectors và eigenvalues của S và sắp xếp
theo thứ từ giảm dần của eigenvalues
5 Chọn K eigenvectors với K trị riêng lớn nhất để xây dựng ma trận UK (projection
matrix) có các cột tạo thành một hệ trực giao
K vectors này, còn được gọi là các thành phần chính, tạo thành một không giancon gần với phân bố của dữ liệu ban đầu đã chuẩn hoá.
Trang 23Principal component analysis (PCA)
6 Chiếu dữ liệu ban đầu đã chuẩn hoá xuống không gian con tìm được.
7 Dữ liệu mới chính là toạ độ của các điểm dữ liệu trên không gian mới.
Dữ liệu ban đầu có thể tính được xấp xỉ theo dữ liệu mới như sau:
Trang 24PCA
Trang 25Principal component analysis (PCA)
Trang 31 Ví dụ:
Trang 32Principal component analysis (PCA)
Problem:
In the case of the data in the two classes, after projection
along the first principal eigenvector, remain well separated
In the case of the data in the two classes, after projection along the first principal eigenvector, is not well separated because the mean of the two classes are too close
Trang 33Principal component analysis (PCA)
Disadvantage of PCA Principal Component Analysis
•higher variance
•bad for discriminability
Fisher Linear Discriminant Linear Discriminant Analysis
•smaller variance
•good discriminability
Trang 34Fisher’s linear discriminant analysis
The objective of LDA is to perform dimensionality reduction while preserving asmuch of the class discriminatory information as possible
Assume we have a set of l-dimensional samples , N1 of whichbelong to class 1 , and N2 to class 2 We seek to obtain a scalar y by projecting the samples x onto a line:
} x
, ,
x , {x 1 2 N
x w
Trang 35Fisher’s linear discriminant analysis
❖ Case: TWO CLASSES
Of all the possible lines we would like to select the one that maximizes the separability of the scalars This is illustrated for the two-dimensional case in the following figures
The two classes are not well separated when projected onto this line
This line succeeded in separating the two classes and in the meantime reducing the dimensionality
of our problem from two features (x1,x2) to only a
scalar value y.
Trang 36Fisher’s linear discriminant analysis
The Fisher linear discriminant is defined as the linear function that maximizes the criterion function:
x
wT
2 2
2 1
2 2
where 1, 2 are the mean values of y,
and 1 , 2 are the variances of y in the
two classes, respectively
Trang 37LDA two classes
In order to find the optimum projection w*, we need to express FDR as an
explicit function of w.
For two equiprobable classes
➢ SW is known as the within-class scatter matrix, with S1 , S2 being the respective covariance matrices.
Now, the scatter of the projection y can then be expressed as a function of the scatter matrix in feature space x.
)
(2
1
2 1
T i i
( 1
w T
w
2 2
2
1 + =
Trang 38LDA two classes
Sb is known as the between-class scatter matrix
➢ m0 is the overall mean of the data x in the original l-dimensional space and m1 , m2 are the mean values in the two classes, respectively
Similarly, the difference between the projected means (in y-space) can be
expressed in terms of the means in the original feature space (x-space)
T
m m
m m
2
1 )
)(
( 2
1
0 2
0 2
0 1
0 1
w w
) -
( 1 2 2 = T Sb
Trang 39LDA two classes
We can finally express the Fisher criterion in terms of SW and Sb as:
w S
w
S FDR
w T
b T
w
w
) -
(
2 2
2 1
2 2
Trang 40LDA two classes
First method:eigen-decompostition
Second method:
0 w
w dw
w S
w T
b T
w w
S
Sw−1 b =
scalar FDR =
=
)
( w
w max arg
max
w S
w
S FDR
w T b T
Trang 41Fisher’s linear discriminant analysis
Case: C-classes
Trang 42Principal component analysis (PCA)
PCA vs LDA
Trang 43Principal component analysis (PCA)
PCA vs LDA
Trang 44Principal component analysis (PCA)
Trang 45Linear Discriminant Analysis (LDA)
Improvements of LDA
Probabilistic LDA
Kernel Fisher Discriminant Analysis
Trang 46 Điều đáng nói là một bức ảnh khuôn mặt có kích thước khoảng 200 × 200 sẽ có sốchiều là 40k - là một số cực lớn, trong khi đó, feature vector thường chỉ có số chiềubằng vài trăm.
Các Eigenfaces chính là các eigenvectors ứng với các trị riêng lớn nhất của ma trậnhiệp phương sai
Trang 47 Face Identification using PCA
An image is a point in a high dimensional space
An N x M image is a point in RNxM
We can define vectors in this space as we did in the 2D case
Trang 48Eigenface: Training
Get 𝑀 training samples with variances
Images are in same size
Trang 49Eigenface: Training
Step 0: Convert all the images in vector form.
Step 1: Calculate the mean (Average Face)
Trang 50Eigenface: Training
Step 2: Normalize vectors
Step 3: Form the covariance matrix
Trang 51Eigenface: Training
Step 4: We calculate the Eigen vectors of Covariance Matrix
Step 5: So, we choose k eigenvectors corresponding to k largest eigenvalues to
construct a projection matrix
Very high dimension
So, we compute
Trang 52Eigenface: Training
Step 6-7: Training Images projected to face space
Trang 53Eigenface: Training
Trang 54Eigenface: Test
Take query image t
Substact mean image : y = t –
Project y into eigenface space and compute projection
Compare projection I with all N training projections
Simple comparison metric: Euclidean
Classifier: multiclass-logistic regression, SVM, …
Trang 55Eigenface
Trang 56 Pros
Ease of implementation
No knowledge of geometry or specific feature of the face required
Sensitive to head scale
Applicable only to front view
Good performance only under controlled background (not including natural scenes)
Trang 57State-of-the-art face recognition performance
OpenFace
FaceNet embedding
InsightFace
Trang 58Feature Selection
Trang 59Introduction: Feature Selection
Need for a small number of discriminative features
To avoid “curse of dimensionality”
To reduce feature measurement cost
To reduce computational burden
Feature selection algorithms can be categorized
Supervised: filter models, wrapper models, and embedded models
Unsupervised
Semi-supervised
Trang 61Feature Selection for Classification
Trang 62 Feature selection for classification attempts to select the minimally sized subset of features according to the following criteria:
the classification accuracy does not significantly decrease; and
the resulting class distribution, given only the values for the selected features,
is as close as possible to the original class distribution, given all features
Trang 63 Feature selection is an optimization problem.
Search the space of possible feature subsets
Pick the subset that is optimal or near-optimal with respect to a certain criterion
Search strategies Evaluation strategies
- Heuristic - Wrapper methods
- Randomized
Trang 64Introduction
Trang 65Algorithms for Flat Features
Trang 66Algorithms for Flat Features
1 Filter Models
Filter models evaluate features without utilizing any classification algorithms
The objective function evaluates feature subsets
by their information content, typically interclassdistance, statistical dependence or information-theoretic measures
Performance criteria : Fisher score, Mutual
information, ReliefF
Trang 67Algorithms for Flat Features
Trang 68Algorithms for Flat Features
❖ Mutual Information based on Methods
Information gain is used to measure the dependence between features and labels
Calculates the information gain between the i-th feature fiand the class labels C as
A feature is relevant if it has a high information gain
Trang 69Algorithms for Flat Features
Trang 702 Wrapper Models
It utilize a specific classifier to evaluate the quality of selected features, and offer
a simple and powerful way to address the problem of feature selection,regardless of the chosen learning machine
A typical wrapper model will perform the following steps:
Step 1: searching a subset of features,
Step 2: evaluating the selected subset of features by the performance of theclassifier,
Step 3: repeating Step 1 and Step 2 until the desired quality is reached
Search strategies: hill-climbing, best-first, GA, …
Trang 71Algorithms for Flat Features
Trang 733 Embedded Models
Embedded Models embedding feature selection with classifier construction, have theadvantages of (1) wrapper models — they include the interaction with theclassification model and (2) filter models—and they are far less computationallyintensive than wrapper methods
Three types of embedded methods
Pruning methods: that first utilize all features to train a model and then attempt to eliminate some features by setting the corresponding coefficients to 0, while maintaining model performance such as recursive feature elimination using a support vector machine.
The second are models with a built-in mechanism for feature selection such as ID3 and C4.5.
The third are regularization models with objective functions that minimize fitting errors and in the meantime
force the coefficients to be small or to be exactly zero Features with coefficients that are close to 0 are then eliminated Methods: Lasso Regularization, Adaptive Lasso, Bridge regularization, Elastic net regularization.
Trang 74Algorithms for Flat Features
❖ Regularization methods
Classifier induction and feature selection are achieved simultaneously by estimating w with properly tuned
penalties.
Feature selection is achieved and only features with nonzero coefficients in w will be used in the classifier
Lasso Regularization: Lasso regularization is based on l1-norm of the coefficient of w and defined as
l1 regularization can generate an estimation of w with exact zero coefficients In other words, there are zero
entities in w, which denotes that the corresponding features are eliminated during the classifier learning process Therefore, it can be used for feature selection.