Bài giảng Khai phá dữ liệu (Data mining) Dimensionality reduction and feature selection

Principal component analysis PCA Mean expected value: giá trị “mong muốn”, biểu diễn giá trị trung bình của một biến.. Principal component analysis PCA Representing Covariance between

Trang 1

Trịnh Tấn Đạt

Khoa CNTT – Đại Học Sài Gòn

Email: trinhtandat@sgu.edu.vn

Website: https://sites.google.com/site/ttdat88/

Trang 2

 Introduction: dimensionality reduction and feature selection

 Dimensionality Reduction

 Principal Component Analysis (PCA)

 Fisher’s linear discriminant analysis (LDA)

 Example: Eigenface

 Feature Selection

 Homework

Trang 3

 High-dimensional data often contain redundant features

 reduce the accuracy of data classification algorithms

 slow down the classification process

 be a problem in storage and retrieval

 hard to interpret (visualize)

 Why we need dimensionality reduction???

 To avoid “curse of dimensionality”

 To reduce feature measurement cost

 To reduce computational cost

https://en.wikipedia.org/wiki/Curse_of_dimensionality

Trang 5

 Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant) and redundant features.

 Dimensionality reduction techniques: feature extraction v.s feature selection

 feature extraction: given N features (set X), extract M new features (set Y) by linear or linear combination of all the N features (i.e PCA, LDA)

non- feature selection: choose a best subset of highly discriminant features of size M from the

available N features (i.e Information Gain, ReliefF, Fisher Score)

Trang 6

Dimensionality Reduction

Trang 7

Principal component analysis (PCA)

❖ Variance v.s Covariance

 Variance : phương sai của một biến ngẫu nhiên là thước đo sự phân tán thống kê của

biến đó, nó hàm ý các giá trị của biến đó thường ở cách giá trị kỳ vọng bao xa

 Covariance: hiệp phương sai là độ đo sự biến thiên cùng nhau của hai biến ngẫu nhiên(phân biệt với phương sai - đo mức độ biến thiên của một biến)

Trang 8

 Mean (expected value): giá trị “mong muốn”,

biểu diễn giá trị trung bình của một biến

 Standard Deviation: Độ lệch chuẩn đo tính

biến động của giá trị mang tính thống kê Nó

cho thấy sự chênh lệch về giá trị của từng thời

điểm đánh giá so với giá trị trung bình

Trang 9

 Representing Covariance between dimensions as a matrix e.g for 3 dimensions:

 cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal

 N-dimensional data will result in NxN covariance matrix

Trang 10

 What is the interpretation of covariance calculations?

e.g.: dữ liệu 2 chiều

x: số lượng giờ học một môn học

y: điểm số của một môn học

covariance value ~ 104.53

what does this value mean?

-> số lượng giờ học tăng  , điểm số 

Trang 11

 Exact value is not as important as it’s sign.

 A positive value of covariance indicates both dimensions increase or

decrease together (e.g as the number of hours studied increases, the marks

in that subject increase.)

 A negative value indicates while one increases the other decreases, or

vice-versa (e.g active social life v.s performance in class.)

 If covariance is zero: the two dimensions are independent of each other

(e.g heights of students vs the marks obtained in a subject.)

Trang 12

Trang 13

 Principal components analysis (PCA) là một phương pháp để đơn giản hóa một tập dữ liệu (simplify a dataset) , chằng hạn giảm số chiều của dữ liệu

“It is a linear transformation that chooses a new coordinate system for the data set such that

 the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component),

 the second greatest variance on the second axis

 and so on ”

 PCA có thể được dùng để giảm số chiều bằng cách loại bỏ những thành phần chính không quan trọng.

Trang 14

 Ví dụ:

khi thực hiện các phân tích đa biến mà

trong đó các biến có tương quan với nhau

gây nhiều khó khăn

loại bỏ sự tương quan này bằng cách xoay trục (cơ sở)

dữ liệu trên trục mới đã giảm

sự tương quan đáng kể (biến Y1 và Y2 gần như không tương quan)

sự thay đổi của dữ liệu phụ thuộc phần lớn vào biến Y1

giảm số chiều dữ liệu mà không làm giàm quá nhiều

“phương sai” của dữ liệu

Trang 15

 Note:

 Giúp giảm số chiều của dữ liệu;

 Thay vì giữ lại các trục tọa độ của không gian cũ, PCA xây dựng một không gian mới ít chiều hơn, nhưng lại có khả năng biểu diễn dữ liệu tốt tương đương không gian cũ

Trang 16

 Ví dụ:

Khám phá liên kết tiềm ẩn nhờ đổi hệ trục tọa độ, cách nhìn khác nhau về cùng một dữ liệu

Trang 17

 Ví dụ:

Notice that "the maximum variance" and "the minimum error" are reached at the same time, namely when the line points to the magenta ticks

Trang 18

 How to find the optimal linear transformation A ( where y = Ax)

-1 Origin of PCA coordinate  mean of samples

-2 Maximize projected variance

-3 Minimize projection cost min x y −

Trang 19

Analysis/

Trang 20

https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-Principal component analysis (PCA)

 Note:

 The eigenvectors of the covariance matrix define a new coordinate system

 Eigenvector with largest eigenvalue captures the most variation amongtraining vectors x

 eigenvector with smallest eigenvalue has least variation

 The eigenvectors are known as principal components

https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-Analysis/

Trang 21

 Algorithm: 7 steps

 Input : N mẫu input

1 Tính vector trung bình của toàn bộ dữ liệu:

2 Trừ mỗi điểm dữ liệu đi vector trung bình của toàn bộ dữ liệu:

Note: subtracting the mean is equivalent to translating the coordinate system to the location of the mean

Trang 22

3 Tính covaricance matrix

4 Perform the eigendecomposition : tìm eigenvectors và eigenvalues của S và sắp xếp

theo thứ từ giảm dần của eigenvalues

5 Chọn K eigenvectors với K trị riêng lớn nhất để xây dựng ma trận UK (projection

matrix) có các cột tạo thành một hệ trực giao

K vectors này, còn được gọi là các thành phần chính, tạo thành một không giancon gần với phân bố của dữ liệu ban đầu đã chuẩn hoá.

Trang 23

6 Chiếu dữ liệu ban đầu đã chuẩn hoá xuống không gian con tìm được.

7 Dữ liệu mới chính là toạ độ của các điểm dữ liệu trên không gian mới.

Dữ liệu ban đầu có thể tính được xấp xỉ theo dữ liệu mới như sau:

Trang 24

PCA

Trang 25

Trang 31

 Ví dụ:

Trang 32

 Problem:

In the case of the data in the two classes, after projection

along the first principal eigenvector, remain well separated

In the case of the data in the two classes, after projection along the first principal eigenvector, is not well separated because the mean of the two classes are too close

Trang 33

 Disadvantage of PCA Principal Component Analysis

•higher variance

•bad for discriminability

Fisher Linear Discriminant Linear Discriminant Analysis

•smaller variance

•good discriminability

Trang 34

Fisher’s linear discriminant analysis

 The objective of LDA is to perform dimensionality reduction while preserving asmuch of the class discriminatory information as possible

 Assume we have a set of l-dimensional samples , N1 of whichbelong to class 1 , and N2 to class 2 We seek to obtain a scalar y by projecting the samples x onto a line:

} x

, ,

x , {x 1 2 N



x w

Trang 35

❖ Case: TWO CLASSES

 Of all the possible lines we would like to select the one that maximizes the separability of the scalars This is illustrated for the two-dimensional case in the following figures

The two classes are not well separated when projected onto this line

This line succeeded in separating the two classes and in the meantime reducing the dimensionality

of our problem from two features (x1,x2) to only a

scalar value y.

Trang 36

 The Fisher linear discriminant is defined as the linear function that maximizes the criterion function:

x

wT

2 2

2 1

2 2

where  1,  2 are the mean values of y,

and 1 ,  2 are the variances of y in the

two classes, respectively

Trang 37

LDA two classes

 In order to find the optimum projection w*, we need to express FDR as an

explicit function of w.

 For two equiprobable classes

➢ SW is known as the within-class scatter matrix, with S1 , S2 being the respective covariance matrices.

 Now, the scatter of the projection y can then be expressed as a function of the scatter matrix in feature space x.

)

(2

1

2 1

T i i

( 1

w T

w

2 2

2

1 +  =



Trang 38

LDA two classes

 Sb is known as the between-class scatter matrix

➢ m0 is the overall mean of the data x in the original l-dimensional space and m1 , m2 are the mean values in the two classes, respectively

 Similarly, the difference between the projected means (in y-space) can be

expressed in terms of the means in the original feature space (x-space)

T

m m

2

1 )

)(

( 2

1

0 2

0 1

w w

） -

（ 1 2 2 = T Sb

Trang 39

LDA two classes

 We can finally express the Fisher criterion in terms of SW and Sb as:

w S

w

S FDR

w T

b T

w

） -

（

2 2

2 1

2 2

Trang 40

LDA two classes

 First method：eigen-decompostition

 Second method：

0 w

w dw

w S

w T

b T

w w

S

Sw−1 b = 

scalar FDR =

=



)

( w

w max arg

max

w S

w

S FDR

w T b T

Trang 41

 Case: C-classes

Trang 42

 PCA vs LDA

Trang 43

 PCA vs LDA

Trang 44

Trang 45

Linear Discriminant Analysis (LDA)

 Improvements of LDA

 Probabilistic LDA

 Kernel Fisher Discriminant Analysis

Trang 46

 Điều đáng nói là một bức ảnh khuôn mặt có kích thước khoảng 200 × 200 sẽ có sốchiều là 40k - là một số cực lớn, trong khi đó, feature vector thường chỉ có số chiềubằng vài trăm.

 Các Eigenfaces chính là các eigenvectors ứng với các trị riêng lớn nhất của ma trậnhiệp phương sai

Trang 47

 Face Identification using PCA

 An image is a point in a high dimensional space

 An N x M image is a point in RNxM

 We can define vectors in this space as we did in the 2D case

Trang 48

Eigenface: Training

 Get 𝑀 training samples with variances

Images are in same size

Trang 49

Eigenface: Training

 Step 0: Convert all the images in vector form.

 Step 1: Calculate the mean (Average Face)

Trang 50

Eigenface: Training

 Step 2: Normalize vectors

 Step 3: Form the covariance matrix

Trang 51

Eigenface: Training

 Step 4: We calculate the Eigen vectors of Covariance Matrix

 Step 5: So, we choose k eigenvectors corresponding to k largest eigenvalues to

construct a projection matrix

Very high dimension

So, we compute

Trang 52

Eigenface: Training

 Step 6-7: Training Images projected to face space

Trang 53

Eigenface: Training

Trang 54

Eigenface: Test

 Take query image t

 Substact mean image : y = t – 

 Project y into eigenface space and compute projection

 Compare projection I with all N training projections

 Simple comparison metric: Euclidean

 Classifier: multiclass-logistic regression, SVM, …

Trang 55

Eigenface

Trang 56

 Pros

 Ease of implementation

 No knowledge of geometry or specific feature of the face required

 Sensitive to head scale

 Applicable only to front view

 Good performance only under controlled background (not including natural scenes)

Trang 57

State-of-the-art face recognition performance

 OpenFace

 FaceNet embedding

 InsightFace

Trang 58

Feature Selection

Trang 59

Introduction: Feature Selection

 Need for a small number of discriminative features

 To avoid “curse of dimensionality”

 To reduce feature measurement cost

 To reduce computational burden

 Feature selection algorithms can be categorized

 Supervised: filter models, wrapper models, and embedded models

 Unsupervised

 Semi-supervised

Trang 61

Feature Selection for Classification

Trang 62

 Feature selection for classification attempts to select the minimally sized subset of features according to the following criteria:

 the classification accuracy does not significantly decrease; and

 the resulting class distribution, given only the values for the selected features,

is as close as possible to the original class distribution, given all features

Trang 63

 Feature selection is an optimization problem.

 Search the space of possible feature subsets

 Pick the subset that is optimal or near-optimal with respect to a certain criterion

Search strategies Evaluation strategies

- Heuristic - Wrapper methods

- Randomized

Trang 64

Introduction

Trang 65

Algorithms for Flat Features

Trang 66

1 Filter Models

 Filter models evaluate features without utilizing any classification algorithms

 The objective function evaluates feature subsets

by their information content, typically interclassdistance, statistical dependence or information-theoretic measures

 Performance criteria : Fisher score, Mutual

information, ReliefF

Trang 67

Trang 68

❖ Mutual Information based on Methods

 Information gain is used to measure the dependence between features and labels

 Calculates the information gain between the i-th feature fiand the class labels C as

 A feature is relevant if it has a high information gain

Trang 69

Trang 70

2 Wrapper Models

 It utilize a specific classifier to evaluate the quality of selected features, and offer

a simple and powerful way to address the problem of feature selection,regardless of the chosen learning machine

 A typical wrapper model will perform the following steps:

 Step 1: searching a subset of features,

 Step 2: evaluating the selected subset of features by the performance of theclassifier,

 Step 3: repeating Step 1 and Step 2 until the desired quality is reached

 Search strategies: hill-climbing, best-first, GA, …

Trang 71

Trang 73

3 Embedded Models

 Embedded Models embedding feature selection with classifier construction, have theadvantages of (1) wrapper models — they include the interaction with theclassification model and (2) filter models—and they are far less computationallyintensive than wrapper methods

 Three types of embedded methods

 Pruning methods: that first utilize all features to train a model and then attempt to eliminate some features by setting the corresponding coefficients to 0, while maintaining model performance such as recursive feature elimination using a support vector machine.

 The second are models with a built-in mechanism for feature selection such as ID3 and C4.5.

 The third are regularization models with objective functions that minimize fitting errors and in the meantime

force the coefficients to be small or to be exactly zero Features with coefficients that are close to 0 are then eliminated Methods: Lasso Regularization, Adaptive Lasso, Bridge regularization, Elastic net regularization.

Trang 74

❖ Regularization methods

 Classifier induction and feature selection are achieved simultaneously by estimating w with properly tuned

penalties.

 Feature selection is achieved and only features with nonzero coefficients in w will be used in the classifier

 Lasso Regularization: Lasso regularization is based on l1-norm of the coefficient of w and defined as

l1 regularization can generate an estimation of w with exact zero coefficients In other words, there are zero

entities in w, which denotes that the corresponding features are eliminated during the classifier learning process Therefore, it can be used for feature selection.

Tiêu đề	Dimensionality Reduction and Feature Selection
Tác giả	Trịnh Tấn Đạt
Người hướng dẫn	TAN DAT TRINH, Ph.D.
Trường học	Saigon University
Chuyên ngành	Information Technology
Thể loại	Lecture
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	81
Dung lượng	2,74 MB