Bài giảng khai phá dữ liệu (data mining) ensemble models

 Different methods for changing training data: Heterogeneous Ensembles : individual models are obtained with different algorithms  combining mechanism is that the output of the classi

Trang 1

Trịnh Tấn Đạt

Khoa CNTT – Đại Học Sài Gòn

Email: trinhtandat@sgu.edu.vn

Website: https://sites.google.com/site/ttdat88/

Trang 3

Introduction

Trang 4

 An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically, by weighted or un-weighted voting)

to classify new examples

 Ensembles are often much more accurate than the individual classifiers that make them up.

Trang 5

Learner 1 Learner 2         Learner KModel 1 Model 2         Model K

Model Combiner Final Model

Trang 6

Necessary and Sufficient Condition

 For the idea to work, the classifiers should be

 Accurate

 Diverse

 Accurate: Has an error rate better than random guessing on new instances

 Diverse: They make different errors on new data points

Trang 7

Why they Work?

 Suppose there are 25 base classifiers

 Each classifier has an error rate,  = 0.35

 Assume classifiers are independent

 Probability that the ensemble classifier makes a wrong prediction:

0)

1(

25

i

i i

Marquis de Condorcet (1785) Majority vote is wrong with probability:

Trang 8

Value of Ensembles

 When combing multiple independent and diverse decisions each of which is at

least more accurate than random guessing, random errors cancel each other out,correct decisions are reinforced

 Human ensembles are demonstrably better

 How many jelly beans in the jar?: Individual estimates vs group average.

Trang 9

A Motivating Example

 Suppose that you are a patient with a set of symptoms

 Instead of taking opinion of just one doctor (classifier), you decide to take opinion of a few doctors!

 Is this a good idea? Indeed it is

 Consult many doctors and then based on their diagnosis; you can get a fairly accurate idea of the diagnosis

Trang 10

The Wisdom of Crowds

 The collective knowledge of a diverse and independent body of people

typically exceeds the knowledge of any single individual and can be harnessed by voting

Trang 11

When Ensembles Work?

 Ensemble methods work better with ‘unstable classifiers’

 Classifiers that are sensitive to minor perturbations in the training set

Trang 12

 Different methods for changing training data:

 Heterogeneous Ensembles : individual models are obtained with different algorithms

 combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier)

Trang 13

Methods of Constructing Ensembles

1. Manipulate training data set

2. Cross-validated Committees

3. Weighted Training Examples

4. Manipulating Input Features

5. Manipulating Output Targets

6. Injecting Randomness

Trang 14

Methods of Constructing Ensembles - 1

1. Manipulate training data set

 Bagging ( b ootstrap agg regation)

 On each run, Bagging presents the learning algorithm with a training set drawn randomly, with replacement, from the original training data This process is called boostrapping.

 Each bootstrap aggregate contains, on the average 63.2% of original training data, with several examples appearing multiple times

Trang 15

2. Cross-validated Committees

 Construct training sets by leaving out disjointed sets of training data

 Idea similar to k-fold cross validation

3. Maintain a set of weights over the training examples At each iteration

the weights are changed to place more emphasis on misclassifiedexamples (Adaboost)

Trang 16

4. Manipulating Input Features

 Works if the input features are highly redundant (e.g., down sampling FFT

bins)

5. Manipulating Output Targets

6. Injecting Randomness

Trang 17

Variance and Bias

 Bias is due to differencesbetween the model and thetrue function

 Variance represents thesensitivity of the model toindividual data points

Trang 18

Variance and Bias

Trang 21

Bias-Variance tradeoff

Trang 22

Voting

Trang 23

Simple Ensemble Techniques

 Max Voting: multiple models are used to make predictions for each data point

The predictions by each model are considered as a ‘vote’ The predictions which

we get from the majority of the models are used as the final prediction.

from sklearn.ensemble import VotingClassifier model1 = LogisticRegression(random_state=1) model2 = tree.DecisionTreeClassifier(random_state=1) model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard') model.fit(x_train,y_train)

model.score(x_test,y_test)

Trang 24

 Averaging: multiple predictions are made for each data point in averaging In this method, we

take an average of predictions from all the models and use it to make the final prediction Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

Trang 25

 Weighted Average: All models are assigned different weights defining

the importance of each model for prediction

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)

Trang 26

Bagging and Boosting

Trang 27

 Bagging and Boosting aggregate multiple hypotheses generated by the samelearning algorithm invoked over different distributions of training data

 Bagging and Boosting generate a classifier with a smaller error on the trainingdata as it combines multiple hypotheses which individually have a large error

 Bagging : reduce variance

 Boosting : reduce bias

Trang 28

 Bagging replicates training sets by sampling with replacement from thetraining instances

 Boosting uses all instances but weights them and therefore produces differentclassifiers

 Classifiers are then combined by voting to create a composite classifier

 Bagging: classifiers have equal vote Majority wins

 Boosting: vote dependent on the classifier’s accuracy Extra weightage to theopinion of some

Trang 29

f

Trang 31

vector

Trang 32

Bagging

Trang 33

original data, with replacement.

 Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates.

algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.

Trang 34

• Also known as bootstrap aggregation

• Sampling uniformly with replacement

• Build classifier on each bootstrap sample

• Each bootstrap sample Di contains approx 63.2% of the original training data

• Remaining (36.8%) are used as test set

Trang 35

• Decision Stump

• Single level decision binary tree

•Accuracy at most 70%

Trang 36

Accuracy of ensemble classifier: 100% ☺

Trang 37

Bagging- Final Points

 Works well if the “base classifiers” are unstable

 Increased accuracy because it reduces the variance of the individualclassifier

 Does not focus on any particular instance of the training data

 Therefore, less susceptible to model over-fitting when applied to noisydata

Trang 38

Bagging Algorithm: Training Phase

1. Initialize the parameters

1 D = {empty set}, the ensemble

2 K = number of classifiers to train

2. For k = 1 to K

1 Take a bootstrap sample Sk from training data

2 Build a classifier Dk using Sk as training set

3 Add the classifier to the ensemble, D = D  Dk

3. Return D

Trang 39

Bagging Algorithm: Classification Phase

4. Run D1, ,DK on the input data x

5. The class with the maximum number of votes is chosen as the label for x

Trang 40

Why Bagging Works

 Main reasons for error in learning are due bias and variance

 Bias is due to differences between the model and the true function

 Variance represents the sensitivity of the model to individual data points

 Does bagging minimizes these errors ? Yes

 Averaging over bootstrap samples can reduce error from variance especially in unstable classifiers

Trang 41

When is bagging useful?

 Bagging is bad if models are very similar (not independent enough)

 This happens if the learning algorithm is stable

 That is, model does not usually change much after changing a few instances

Trang 43

 Relatively unstable (high variance)

 The aggregated model is then usually better than the original modeltrained on full dataset

Trang 44

Boosting

Trang 46

 Instances are given weights At each iteration, a new hypothesis is learned and the instances are reweighted to focus on instances that the most-recently-

learned classifier got wrong

 Initially, all N instances are assigned equal weights

 Unlike bagging, weights may change at the end of a boosting round

Trang 47

 Equal weights are assigned to each training instance (1/N for round 1)

 After a classifier Mi is learned, the weights are adjusted to allow thesubsequent classifier M to “pay more attention” to instances that weremisclassified by Mi

 Final boosted classifier M* combines the votes of each individual classifier

 Weight of each classifier’s vote is a function of its accuracy

 Adaboost – popular boosting algorithm

Trang 48

 AdaBoost (adaptive boosting) is an ensemble learning algorithm that can

be used for classification or regression

 AdaBoost creates the strong learner by iteratively adding weak learners

Trang 49

Toy Example – taken from Antonio Torralba @MIT

Weak learners from the

+1 -1

yt =

Trang 50

Toy example

This one seems to be the best

Each data point has

a class label:

wt =1 and a weight:

+1 -1

yt =

This is a ‘weak classifier’: It performs slightly better than chance.

Trang 51

Toy example

We set a new problem for which the previous weak classifier performs at chance again

Each data point has

a class label:

We update the weights:

+1 -1

yt =

Trang 52

Toy example

Each data point has

a class label:

+1 -1

yt =

Trang 53

Toy example

Each data point has

a class label:

+1 -1

yt =

Trang 54

Toy example

Each data point has

a class label:

+1 -1

yt =

Trang 56

Adaboost Strategy

 At each stage of the algorithm, Adaboost trains a new classifier using a data set

in which the weighting coefficients have been adjusted according to theperformance of the previously trained classifier so as to give greater weight tomisclassified instances

 Finally, when the desired number of base classifiers have been trained, theirresults are combined to form a committee using different weights to differentclassifiers

Trang 57

Adaboost: Initialization

 Given a set of input vectors {x1,x2, ,xN} along with binary target values

{t1,t2, ,tN}

 That is,

 Each instance is given a weight w n

 Initially, set wn = 1/N, for all n

 Assume that we have a procedure to train the base (weak) classifier (Say, a Perceptron)

tn Î { - 1, + 1}

Trang 58

Boosting Framework

Trang 61

Indicator Function

 The I above is called the indicator function

 Notice that I = 1 when an instance is misclassified

 Jm is the “error” function of the mth classifier It identifies the weights

associated with each misclassified training instance and adds them up

 The quantity epsilonm can be thought of as “error rate” of each base classifier on the data set

Trang 62

Epsilon & Alpha

b Evaluate the quantities

Trang 63

Weight Update & Prediction

c Update the data weighting coefficients

3 Make predictions using the final model

Trang 64

 Note that the first base classifier is trained using that are all equal

 In subsequent iterations these weights are increased for data instances that aremisclassified and decreased/unchanged for those correctly classified

 The alphas eventually give greater weight to the more accurate classifiers

w n(1)

Trang 65

Experimental Results on Ensembles

(Freund & Schapire, 1996; Quinlan, 1996)

 Ensembles have been used to improve generalization accuracy on a wide variety of problems.

 On average, Boosting provides a larger increase in accuracy than Bagging.

 Boosting on rare occasions can degrade accuracy.

 Boosting is particularly subject to over-fitting when there is significant noise in the training data.

Trang 66

Issues in Ensembles

 Parallelism in Ensembles: Bagging is easily parallelized, Boosting is not.

 Variants of Boosting to handle noisy data.

 How “weak” should a base-learner for Boosting be?

 Combining Boosting and Bagging

Trang 67

 AdaBoost.M1 and AdaBoost.M2 – original algorithms for binary and multiclass classification

 LogitBoost – binary classification (for poorly separable classes)

 Gentle AdaBoost or GentleBoost – binary classification (for use with multilevel categorical predictors)

 RobustBoost – binary classification (robust against label noise)

 LSBoost – least squares boosting (for regression ensembles)

Trang 69

 Gradient boosting (GBoosting)

 Stochastic Gradient Boosting

 Penalized Gradient Boosting

Trang 70

 Can combine many weak classifiers/regressors into a stronger classifier;

voting, averaging, bagging

 if weak classifiers/regressors are better than random.

 if there is sufficient de-correlation (independence) amongst the weak classifiers/regressors.

 Can combine many (high-bias) weak classifiers/regressors into a strong

classifier; boosting

 if weak classifiers/regressors are chosen and combined using knowledge of how

well they and others performed on the task on training data.

 The selection and combination encourages the weak classifiers to be complementary, diverse and de-correlated.

Trang 71

Stacking and Blending

Trang 72

 Both bagging and boosting assume we have a single “base learning” algorithm

 But what if we want to ensemble an arbitrary set of classifiers?

 E.g., combine the output of a SVM, nạve Bayes, and a nearest neighbor model?

Trang 73

Stacking

Trang 74

Meta-model

Trang 75

When does stacking work?

 Stacking works best when the base models have complementary strengthsand weaknesses

 For example: combining nearest neighbor models with different values, Nạve Bayes, and logistic regression Each of these models hasdifferent underlying assumptions so (hopefully) they will becomplementary

Trang 76

k-Stacked learners: first attempt

Trang 77

 EX:

 Step 1: The train set is split into 10 parts

 Step 2: A base model (suppose a

decision tree) is fitted on 9 parts and predictions are made for the 10th part

Trang 78

predictions for the train set and test set.

Trang 79

 Step 5: The predictions from the train set areused as features to build a new model ( canuse logistic regression)

 Step 6: This model is used to make finalpredictions on the test prediction set

Trang 81

 Step 1: The train set is split into training and

validation sets

 Step 2: Model(s) are fitted on the training set

 Step 3: The predictions are made on the

validation set and the test set

 Step 4: The validation set and its predictions are used as features to build a new model

 Step 5: This model is used to make final

predictions on the test and meta-features

Trang 82

model = LogisticRegression() model.fit(df_val,y_val)

model.score(df_test,y_test)

Trang 83

Netflix challenge - 1 million USD 2009)

(2006- Netflix, an online DVD-rental and online video streaming service

 Task: predict user ratings to films from ratings by other users

 Goal: improve existing method by 10%

 Winner’s solution: ensemble with over 500 heterogeneous models,

aggregated with gradient boosted decision trees

 Ensembles based on blending/stacking were key approaches used in the netflix competition

Trang 84

 Ensemble methods combine several hypotheses into one prediction

 They work better than the best individual hypothesis from the same classbecause they reduce bias or variance (or both)

 Bagging is mainly a variance-reduction technique, useful for complexhypotheses

 Boosting focuses on harder examples, and gives a weighted vote to thehypotheses

 Boosting works by reducing bias and increasing classification margin

 Stacking is a generic approach to ensemble various models and performs verywell in practice

Tiêu đề	Bài Giảng Khai Phá Dữ Liệu (Data Mining) Ensemble Models
Tác giả	Trịnh Tấn Đạt
Người hướng dẫn	TAN DAT TRINH, Ph.D.
Trường học	Saigon University
Chuyên ngành	Information Technology
Thể loại	lecture
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	90
Dung lượng	1,84 MB