Different methods for changing training data: Heterogeneous Ensembles : individual models are obtained with different algorithms combining mechanism is that the output of the classi
Trang 1Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email: trinhtandat@sgu.edu.vn
Website: https://sites.google.com/site/ttdat88/
Trang 3Introduction
Trang 4 An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically, by weighted or un-weighted voting)
to classify new examples
Ensembles are often much more accurate than the individual classifiers that make them up.
Trang 5Learner 1 Learner 2 Learner KModel 1 Model 2 Model K
Model Combiner Final Model
Trang 6Necessary and Sufficient Condition
For the idea to work, the classifiers should be
Accurate
Diverse
Accurate: Has an error rate better than random guessing on new instances
Diverse: They make different errors on new data points
Trang 7Why they Work?
Suppose there are 25 base classifiers
Each classifier has an error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction:
0)
1(
25
i
i i
Marquis de Condorcet (1785) Majority vote is wrong with probability:
Trang 8Value of Ensembles
When combing multiple independent and diverse decisions each of which is at
least more accurate than random guessing, random errors cancel each other out,correct decisions are reinforced
Human ensembles are demonstrably better
How many jelly beans in the jar?: Individual estimates vs group average.
Trang 9A Motivating Example
Suppose that you are a patient with a set of symptoms
Instead of taking opinion of just one doctor (classifier), you decide to take opinion of a few doctors!
Is this a good idea? Indeed it is
Consult many doctors and then based on their diagnosis; you can get a fairly accurate idea of the diagnosis
Trang 10The Wisdom of Crowds
The collective knowledge of a diverse and independent body of people
typically exceeds the knowledge of any single individual and can be harnessed by voting
Trang 11When Ensembles Work?
Ensemble methods work better with ‘unstable classifiers’
Classifiers that are sensitive to minor perturbations in the training set
Trang 12 Different methods for changing training data:
Heterogeneous Ensembles : individual models are obtained with different algorithms
combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier)
Trang 13Methods of Constructing Ensembles
1. Manipulate training data set
2. Cross-validated Committees
3. Weighted Training Examples
4. Manipulating Input Features
5. Manipulating Output Targets
6. Injecting Randomness
Trang 14Methods of Constructing Ensembles - 1
1. Manipulate training data set
Bagging ( b ootstrap agg regation)
On each run, Bagging presents the learning algorithm with a training set drawn randomly, with replacement, from the original training data This process is called boostrapping.
Each bootstrap aggregate contains, on the average 63.2% of original training data, with several examples appearing multiple times
Trang 15Methods of Constructing Ensembles - 2
2. Cross-validated Committees
Construct training sets by leaving out disjointed sets of training data
Idea similar to k-fold cross validation
3. Maintain a set of weights over the training examples At each iteration
the weights are changed to place more emphasis on misclassifiedexamples (Adaboost)
Trang 16Methods of Constructing Ensembles - 3
4. Manipulating Input Features
Works if the input features are highly redundant (e.g., down sampling FFT
bins)
5. Manipulating Output Targets
6. Injecting Randomness
Trang 17Variance and Bias
Bias is due to differencesbetween the model and thetrue function
Variance represents thesensitivity of the model toindividual data points
Trang 18Variance and Bias
Trang 21Bias-Variance tradeoff
Trang 22Voting
Trang 23Simple Ensemble Techniques
Max Voting: multiple models are used to make predictions for each data point
The predictions by each model are considered as a ‘vote’ The predictions which
we get from the majority of the models are used as the final prediction.
from sklearn.ensemble import VotingClassifier model1 = LogisticRegression(random_state=1) model2 = tree.DecisionTreeClassifier(random_state=1) model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard') model.fit(x_train,y_train)
model.score(x_test,y_test)
Trang 24Simple Ensemble Techniques
Averaging: multiple predictions are made for each data point in averaging In this method, we
take an average of predictions from all the models and use it to make the final prediction Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.
Trang 25Simple Ensemble Techniques
Weighted Average: All models are assigned different weights defining
the importance of each model for prediction
finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)
Trang 26Bagging and Boosting
Trang 27Bagging and Boosting
Bagging and Boosting aggregate multiple hypotheses generated by the samelearning algorithm invoked over different distributions of training data
Bagging and Boosting generate a classifier with a smaller error on the trainingdata as it combines multiple hypotheses which individually have a large error
Bagging : reduce variance
Boosting : reduce bias
Trang 28Bagging and Boosting
Bagging replicates training sets by sampling with replacement from thetraining instances
Boosting uses all instances but weights them and therefore produces differentclassifiers
Classifiers are then combined by voting to create a composite classifier
Bagging: classifiers have equal vote Majority wins
Boosting: vote dependent on the classifier’s accuracy Extra weightage to theopinion of some
Trang 29f
Trang 31vector
Trang 32Bagging
Trang 33original data, with replacement.
Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates.
algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.
Trang 34• Also known as bootstrap aggregation
• Sampling uniformly with replacement
• Build classifier on each bootstrap sample
• Each bootstrap sample Di contains approx 63.2% of the original training data
• Remaining (36.8%) are used as test set
Trang 35• Decision Stump
• Single level decision binary tree
•Accuracy at most 70%
Trang 36Accuracy of ensemble classifier: 100% ☺
Trang 37Bagging- Final Points
Works well if the “base classifiers” are unstable
Increased accuracy because it reduces the variance of the individualclassifier
Does not focus on any particular instance of the training data
Therefore, less susceptible to model over-fitting when applied to noisydata
Trang 38Bagging Algorithm: Training Phase
1. Initialize the parameters
1 D = {empty set}, the ensemble
2 K = number of classifiers to train
2. For k = 1 to K
1 Take a bootstrap sample Sk from training data
2 Build a classifier Dk using Sk as training set
3 Add the classifier to the ensemble, D = D Dk
3. Return D
Trang 39Bagging Algorithm: Classification Phase
4. Run D1, ,DK on the input data x
5. The class with the maximum number of votes is chosen as the label for x
Trang 40Why Bagging Works
Main reasons for error in learning are due bias and variance
Bias is due to differences between the model and the true function
Variance represents the sensitivity of the model to individual data points
Does bagging minimizes these errors ? Yes
Averaging over bootstrap samples can reduce error from variance especially in unstable classifiers
Trang 41When is bagging useful?
Bagging is bad if models are very similar (not independent enough)
This happens if the learning algorithm is stable
That is, model does not usually change much after changing a few instances
Trang 43 Relatively unstable (high variance)
The aggregated model is then usually better than the original modeltrained on full dataset
Trang 44Boosting
Trang 46 Instances are given weights At each iteration, a new hypothesis is learned and the instances are reweighted to focus on instances that the most-recently-
learned classifier got wrong
Initially, all N instances are assigned equal weights
Unlike bagging, weights may change at the end of a boosting round
Trang 47 Equal weights are assigned to each training instance (1/N for round 1)
After a classifier Mi is learned, the weights are adjusted to allow thesubsequent classifier M to “pay more attention” to instances that weremisclassified by Mi
Final boosted classifier M* combines the votes of each individual classifier
Weight of each classifier’s vote is a function of its accuracy
Adaboost – popular boosting algorithm
Trang 48 AdaBoost (adaptive boosting) is an ensemble learning algorithm that can
be used for classification or regression
AdaBoost creates the strong learner by iteratively adding weak learners
Trang 49Toy Example – taken from Antonio Torralba @MIT
Weak learners from the
+1 -1
yt =
Trang 50Toy example
This one seems to be the best
Each data point has
a class label:
wt =1 and a weight:
+1 -1
yt =
This is a ‘weak classifier’: It performs slightly better than chance.
Trang 51Toy example
We set a new problem for which the previous weak classifier performs at chance again
Each data point has
a class label:
We update the weights:
+1 -1
yt =
Trang 52Toy example
We set a new problem for which the previous weak classifier performs at chance again
Each data point has
a class label:
We update the weights:
+1 -1
yt =
Trang 53Toy example
We set a new problem for which the previous weak classifier performs at chance again
Each data point has
a class label:
We update the weights:
+1 -1
yt =
Trang 54Toy example
We set a new problem for which the previous weak classifier performs at chance again
Each data point has
a class label:
We update the weights:
+1 -1
yt =
Trang 56Adaboost Strategy
At each stage of the algorithm, Adaboost trains a new classifier using a data set
in which the weighting coefficients have been adjusted according to theperformance of the previously trained classifier so as to give greater weight tomisclassified instances
Finally, when the desired number of base classifiers have been trained, theirresults are combined to form a committee using different weights to differentclassifiers
Trang 57Adaboost: Initialization
Given a set of input vectors {x1,x2, ,xN} along with binary target values
{t1,t2, ,tN}
That is,
Each instance is given a weight w n
Initially, set wn = 1/N, for all n
Assume that we have a procedure to train the base (weak) classifier (Say, a Perceptron)
tn Î { - 1, + 1}
Trang 58Boosting Framework
Trang 61Indicator Function
The I above is called the indicator function
Notice that I = 1 when an instance is misclassified
Jm is the “error” function of the mth classifier It identifies the weights
associated with each misclassified training instance and adds them up
The quantity epsilonm can be thought of as “error rate” of each base classifier on the data set
Trang 62Epsilon & Alpha
b Evaluate the quantities
Trang 63Weight Update & Prediction
c Update the data weighting coefficients
3 Make predictions using the final model
Trang 64 Note that the first base classifier is trained using that are all equal
In subsequent iterations these weights are increased for data instances that aremisclassified and decreased/unchanged for those correctly classified
The alphas eventually give greater weight to the more accurate classifiers
w n(1)
Trang 65Experimental Results on Ensembles
(Freund & Schapire, 1996; Quinlan, 1996)
Ensembles have been used to improve generalization accuracy on a wide variety of problems.
On average, Boosting provides a larger increase in accuracy than Bagging.
Boosting on rare occasions can degrade accuracy.
Boosting is particularly subject to over-fitting when there is significant noise in the training data.
Trang 66Issues in Ensembles
Parallelism in Ensembles: Bagging is easily parallelized, Boosting is not.
Variants of Boosting to handle noisy data.
How “weak” should a base-learner for Boosting be?
Combining Boosting and Bagging
Trang 67 AdaBoost.M1 and AdaBoost.M2 – original algorithms for binary and multiclass classification
LogitBoost – binary classification (for poorly separable classes)
Gentle AdaBoost or GentleBoost – binary classification (for use with multilevel categorical predictors)
RobustBoost – binary classification (robust against label noise)
LSBoost – least squares boosting (for regression ensembles)
Trang 69 Gradient boosting (GBoosting)
Stochastic Gradient Boosting
Penalized Gradient Boosting
Trang 70 Can combine many weak classifiers/regressors into a stronger classifier;
voting, averaging, bagging
if weak classifiers/regressors are better than random.
if there is sufficient de-correlation (independence) amongst the weak classifiers/regressors.
Can combine many (high-bias) weak classifiers/regressors into a strong
classifier; boosting
if weak classifiers/regressors are chosen and combined using knowledge of how
well they and others performed on the task on training data.
The selection and combination encourages the weak classifiers to be complementary, diverse and de-correlated.
Trang 71Stacking and Blending
Trang 72 Both bagging and boosting assume we have a single “base learning” algorithm
But what if we want to ensemble an arbitrary set of classifiers?
E.g., combine the output of a SVM, nạve Bayes, and a nearest neighbor model?
Trang 73Stacking
Trang 74Meta-model
Trang 75When does stacking work?
Stacking works best when the base models have complementary strengthsand weaknesses
For example: combining nearest neighbor models with different values, Nạve Bayes, and logistic regression Each of these models hasdifferent underlying assumptions so (hopefully) they will becomplementary
Trang 76k-Stacked learners: first attempt
Trang 77 EX:
Step 1: The train set is split into 10 parts
Step 2: A base model (suppose a
decision tree) is fitted on 9 parts and predictions are made for the 10th part
Trang 78predictions for the train set and test set.
Trang 79 Step 5: The predictions from the train set areused as features to build a new model ( canuse logistic regression)
Step 6: This model is used to make finalpredictions on the test prediction set
Trang 81 Step 1: The train set is split into training and
validation sets
Step 2: Model(s) are fitted on the training set
Step 3: The predictions are made on the
validation set and the test set
Step 4: The validation set and its predictions are used as features to build a new model
Step 5: This model is used to make final
predictions on the test and meta-features
Trang 82model = LogisticRegression() model.fit(df_val,y_val)
model.score(df_test,y_test)
Trang 83Netflix challenge - 1 million USD 2009)
(2006- Netflix, an online DVD-rental and online video streaming service
Task: predict user ratings to films from ratings by other users
Goal: improve existing method by 10%
Winner’s solution: ensemble with over 500 heterogeneous models,
aggregated with gradient boosted decision trees
Ensembles based on blending/stacking were key approaches used in the netflix competition
Trang 84 Ensemble methods combine several hypotheses into one prediction
They work better than the best individual hypothesis from the same classbecause they reduce bias or variance (or both)
Bagging is mainly a variance-reduction technique, useful for complexhypotheses
Boosting focuses on harder examples, and gives a weighted vote to thehypotheses
Boosting works by reducing bias and increasing classification margin
Stacking is a generic approach to ensemble various models and performs verywell in practice