The Boosting Approach• devise computer program for deriving rough rules of thumb • apply procedure to subset of examples • obtain rule of thumb • apply to 2nd subset of examples • obtain
Trang 1Boosting: Foundations and Algorithms
Rob Schapire
Trang 2Example: Spam Filtering
• problem: filter out spam (junk email)
• gather large collection of examples of spamandnon-spam:
From: yoav@ucsd.edu Rob, can you review a paper non-spam
From: xa412@hotmail.com Earn money without working!!!! spam
• goal: have computer learn from examples to distinguish spamfrom non-spam
Trang 3machine learningalgorithm
classificationpredicted
ruleclassificationexamples
training
labeled
Trang 4Examples of Classification Problems
• text categorization (e.g., spam filtering)
Trang 5Back to Spam
• main observation:
• easyto find “rules of thumb” that are “often” correct
• If ‘viagra’ occurs in message, then predict ‘spam’
• hardto find single rule that is very highly accurate
Trang 6The Boosting Approach
• devise computer program for deriving rough rules of thumb
• apply procedure to subset of examples
• obtain rule of thumb
• apply to 2nd subset of examples
• obtain 2nd rule of thumb
• repeatT times
Trang 7Key Details
• how to choose exampleson each round?
• concentrate on “hardest” examples
(those most often misclassified by previous rules ofthumb)
• how to combine rules of thumb into single prediction rule?
• take (weighted) majority vote of rules of thumb
Trang 8• boosting= general method of converting rough rules of
thumb into highly accurate prediction rule
• technically:
• assume given“weak” learning algorithm that can
consistently find classifiers (“rules of thumb”) at leastslightly better than random, say, accuracy ≥ 55%
(in two-class setting) [ “weak learning assumption” ]
• given sufficient data, a boosting algorithmcanprovablyconstruct single classifier with very high accuracy, say,99%
Trang 9Early History
• [Valiant ’84]:
• introduced theoretical (“PAC”) model for studyingmachine learning
• [Kearns & Valiant ’88]:
• open problem of finding a boosting algorithm
• if boosting possible, then
• can use (fairly) wildguesses to produce highly accuratepredictions
• if can learn “part way” then can learn “all the way”
• should be able to improveany learning algorithm
• for any learning problem:
• either can always learn with nearlyperfect accuracy
• orthere exist cases where cannot learn even slightlybetter thanrandom guessing
Trang 10First Boosting Algorithms
• [Schapire ’89]:
• first provable boosting algorithm
• [Freund ’90]:
• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:
• first experiments using boosting
• limited by practical drawbacks
• [Freund & Schapire ’95]:
• introduced “AdaBoost” algorithm
• strong practical advantages over previous boostingalgorithms
Trang 11A Formal Description of Boosting
• given training set (x1, y1), , (xm, ym)
• yi ∈ {−1, +1} correct label of instancexi ∈ X
Trang 12where Zt= normalization factor
Trang 13Toy Example
D1
weak classifiers = vertical or horizontal half-planes
Trang 151111 1111 1111 1111 1111 1111 1111 1111 1111 1111
0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000
1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111
=0.21
=0.65
Trang 16Round 3
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000
1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111
111111111111111 111111111111111 111111111111111 111111111111111 111111111111111 111111111111111 111111111111111 111111111111111
000000000000000 000000000000000 000000000000000
111111111111111 111111111111111 111111111111111 111111111111111
h3
α
ε33=0.92
=0.14
Trang 17Final Classifier Final Classifier
000000000 000000000 000000000 000000000 000000000 000000000 000000000
111111111 111111111 111111111 111111111 111111111 111111111 111111111
1111111111 1111111111 1111111111
0000000000 0000000000 0000000000 0000000000 0000000000
1111111111 1111111111 1111111111 1111111111 1111111111
000 000 000 000 000 000 000
111 111 111 111 111 111 111
000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000
111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111
000000000 000000000 000000000 000000000 000000000 000000000 000000000
111111111 111111111 111111111 111111111 111111111 111111111 111111111
00 00 00 00 00 00 00
11 11 11 11 11 11 11
Trang 18AdaBoost (recap) AdaBoost (recap) AdaBoost (recap)
• given training set (x1, y1), , (xm, ym)
where Zt= normalization factor
• Hfinal(x) = sign
TX
αtht(x)
!
Trang 19Analyzing the Training Error
t
q
1 − 4γ2 t
Trang 20How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess)
0.2 0.4 0.6 0.8 1
# of rounds (
T) train
test
expect:
• training error to continue to drop (or reach zero)
• test error to increasewhen Hfinal becomes “too complex”
• “Occam’s razor”
• overfitting
• hard to know when to stop training
Trang 21• with high probability:
generalization error ≤ training error + ˜O
rdTm
Trang 22Overfitting Can Happen
• but often doesn’t
Trang 23Actual Typical Run
0 5 10 15 20
testtrain
• test error doesnot increase, even after 1000 rounds
• (total size > 2,000,000 nodes)
• test error continues to drop even after training error is zero!
# rounds
train error 0.0 0.0 0.0test error 8.4 3.3 3.1
• Occam’s razorwrongly predicts “simpler” rule is better
Trang 24A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications areright or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
• measure confidence bymargin = strength of the vote
=(weighted fraction voting correctly)
−(weighted fraction voting incorrectly)
correctincorrect
correctincorrect
Trang 25Empirical Evidence: The Margin Distribution
)
T
0.5 1.0
1000 100
Trang 26Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better boundon generalizationerror (independent of number of rounds)
• Theorem: boosting tends to increase marginsof trainingexamples (given weak learning assumption)
• moreover, larger edges ⇒ larger margins
Trang 27Consequences of Margins Theory
• predicts good generalization with no overfitting if:
• weak classifiers have large edges (implyinglarge margins)
• weak classifiers not too complex relative to size oftraining set
• e.g., boosting decision trees resistant to overfitting since treesoften have large edges and limited complexity
• overfittingmay occur if:
• small edges (underfitting), or
• overly complex weak classifiers
• e.g., heart-disease dataset:
• stumps yield small edges
• also, small dataset
Trang 28More Theory
• many other ways of understanding AdaBoost:
• as playing a repeated two-person matrixgame
• weak learning assumption and optimal margin havenatural game-theoretic interpretations
• special case of more general game-playing algorithm
• as a method for minimizing a particular lossfunction vianumerical techniques, such ascoordinate descent
• using convex analysisin an “information-geometric”framework that includeslogistic regression andmaximumentropy
• as a universally consistentstatistical method
• can also deriveoptimal boosting algorithm, and extend tocontinuous time
Trang 29Practical Advantages of AdaBoost
• fast
• simpleand easy to program
• no parametersto tune (except T)
• flexible — can combine with anylearning algorithm
• no prior knowledge needed about weak learner
• provably effective, provided can consistently find rough rules
of thumb
→ shift in mind set — goal now is merely to find classifiersbarely better than random guessing
• versatile
• can use with data that is textual, numeric, discrete, etc
• has been extended to learning problems well beyondbinary classification
Trang 30• performance of AdaBoost depends on dataandweak learner
• consistent with theory, AdaBoost canfailif
• weak classifiers toocomplex
→ overfitting
• weak classifiers tooweak (γt → 0 too quickly)
→ underfitting
→ low margins→ overfitting
• empirically, AdaBoost seems especially susceptible to uniformnoise
Trang 31UCI Experiments
[with Freund]
• tested AdaBoost on UCI benchmarks
• used:
• C4.5 (Quinlan’s decision tree algorithm)
• “decision stumps”: very simple rules of thumb that test
on single attributes
-1
predict
+1 predict no yes height > 5 feet ? predict
-1
predict +1
no yes
eye color = brown ?
Trang 320 5 10 15 20 25 30
Trang 33Application: Detecting Faces
[Viola & Jones]
• problem: find facesin photograph or movie
• weak classifiers: detect light/dark rectangles in image
• many clever tricks to make extremely fast and accurate
Trang 34Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]
• application: automatic “store front” or “help desk” for AT&TLabs’ Natural Voices business
• caller can request demo, pricing information, technical
support, sales agent, etc
• interactive dialogue
Trang 35text response
text
raw
recognizer speech automatic text−to−speech
category predicted
Human
manager dialogue
• NLU’s job: classify caller utterances into 24 categories(demo, sales rep,pricing info,yes,no, etc.)
• weak classifiers: test for presence of word or phrase