boosting - foundations and algorithms

The Boosting Approach• devise computer program for deriving rough rules of thumb • apply procedure to subset of examples • obtain rule of thumb • apply to 2nd subset of examples • obtain

Trang 1

Boosting: Foundations and Algorithms

Rob Schapire

Trang 2

Example: Spam Filtering

• problem: filter out spam (junk email)

• gather large collection of examples of spamandnon-spam:

From: yoav@ucsd.edu Rob, can you review a paper non-spam

From: xa412@hotmail.com Earn money without working!!!! spam

• goal: have computer learn from examples to distinguish spamfrom non-spam

Trang 3

machine learningalgorithm

classificationpredicted

ruleclassificationexamples

training

labeled

Trang 4

Examples of Classification Problems

• text categorization (e.g., spam filtering)

Trang 5

Back to Spam

• main observation:

• easyto find “rules of thumb” that are “often” correct

• If ‘viagra’ occurs in message, then predict ‘spam’

• hardto find single rule that is very highly accurate

Trang 6

The Boosting Approach

• devise computer program for deriving rough rules of thumb

• apply procedure to subset of examples

• obtain rule of thumb

• apply to 2nd subset of examples

• obtain 2nd rule of thumb

• repeatT times

Trang 7

Key Details

• how to choose exampleson each round?

• concentrate on “hardest” examples

(those most often misclassified by previous rules ofthumb)

• how to combine rules of thumb into single prediction rule?

• take (weighted) majority vote of rules of thumb

Trang 8

• boosting= general method of converting rough rules of

thumb into highly accurate prediction rule

• technically:

• assume given“weak” learning algorithm that can

consistently find classifiers (“rules of thumb”) at leastslightly better than random, say, accuracy ≥ 55%

(in two-class setting) [ “weak learning assumption” ]

• given sufficient data, a boosting algorithmcanprovablyconstruct single classifier with very high accuracy, say,99%

Trang 9

Early History

• [Valiant ’84]:

• introduced theoretical (“PAC”) model for studyingmachine learning

• [Kearns & Valiant ’88]:

• open problem of finding a boosting algorithm

• if boosting possible, then

• can use (fairly) wildguesses to produce highly accuratepredictions

• if can learn “part way” then can learn “all the way”

• should be able to improveany learning algorithm

• for any learning problem:

• either can always learn with nearlyperfect accuracy

• orthere exist cases where cannot learn even slightlybetter thanrandom guessing

Trang 10

First Boosting Algorithms

• [Schapire ’89]:

• first provable boosting algorithm

• [Freund ’90]:

• “optimal” algorithm that “boosts by majority”

• [Drucker, Schapire & Simard ’92]:

• first experiments using boosting

• limited by practical drawbacks

• [Freund & Schapire ’95]:

• introduced “AdaBoost” algorithm

• strong practical advantages over previous boostingalgorithms

Trang 11

A Formal Description of Boosting

• given training set (x1, y1), , (xm, ym)

• yi ∈ {−1, +1} correct label of instancexi ∈ X

Trang 12

where Zt= normalization factor

Trang 13

Toy Example

D1

weak classifiers = vertical or horizontal half-planes

Trang 15

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000

1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111

=0.21

=0.65

Trang 16

Round 3

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000

1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111 1111111111111

111111111111111 111111111111111 111111111111111 111111111111111 111111111111111 111111111111111 111111111111111 111111111111111

000000000000000 000000000000000 000000000000000

111111111111111 111111111111111 111111111111111 111111111111111

h3

α

ε33=0.92

=0.14

Trang 17

Final Classifier Final Classifier

000000000 000000000 000000000 000000000 000000000 000000000 000000000

111111111 111111111 111111111 111111111 111111111 111111111 111111111

1111111111 1111111111 1111111111

0000000000 0000000000 0000000000 0000000000 0000000000

1111111111 1111111111 1111111111 1111111111 1111111111

000 000 000 000 000 000 000

111 111 111 111 111 111 111

000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000

111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111 111111111111

000000000 000000000 000000000 000000000 000000000 000000000 000000000

111111111 111111111 111111111 111111111 111111111 111111111 111111111

00 00 00 00 00 00 00

11 11 11 11 11 11 11

Trang 18

AdaBoost (recap) AdaBoost (recap) AdaBoost (recap)

• given training set (x1, y1), , (xm, ym)

where Zt= normalization factor

• Hfinal(x) = sign

TX

αtht(x)

!

Trang 19

Analyzing the Training Error

t

q

1 − 4γ2 t

Trang 20

How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess) How Will Test Error Behave? (A First Guess)

0.2 0.4 0.6 0.8 1

# of rounds (

T) train

test

expect:

• training error to continue to drop (or reach zero)

• test error to increasewhen Hfinal becomes “too complex”

• “Occam’s razor”

• overfitting

• hard to know when to stop training

Trang 21

• with high probability:

generalization error ≤ training error + ˜O

rdTm

Trang 22

Overfitting Can Happen

• but often doesn’t

Trang 23

Actual Typical Run

0 5 10 15 20

testtrain

• test error doesnot increase, even after 1000 rounds

• (total size > 2,000,000 nodes)

• test error continues to drop even after training error is zero!

# rounds

train error 0.0 0.0 0.0test error 8.4 3.3 3.1

• Occam’s razorwrongly predicts “simpler” rule is better

Trang 24

A Better Story: The Margins Explanation

[with Freund, Bartlett & Lee]

• key idea:

• training error only measures whether classifications areright or wrong

• should also consider confidence of classifications

• recall: Hfinal is weighted majority vote of weak classifiers

• measure confidence bymargin = strength of the vote

=(weighted fraction voting correctly)

−(weighted fraction voting incorrectly)

correctincorrect

Trang 25

Empirical Evidence: The Margin Distribution

)

T

0.5 1.0

1000 100

Trang 26

Theoretical Evidence: Analyzing Boosting Using Margins

• Theorem: large margins ⇒ better boundon generalizationerror (independent of number of rounds)

• Theorem: boosting tends to increase marginsof trainingexamples (given weak learning assumption)

• moreover, larger edges ⇒ larger margins

Trang 27

Consequences of Margins Theory

• predicts good generalization with no overfitting if:

• weak classifiers have large edges (implyinglarge margins)

• weak classifiers not too complex relative to size oftraining set

• e.g., boosting decision trees resistant to overfitting since treesoften have large edges and limited complexity

• overfittingmay occur if:

• small edges (underfitting), or

• overly complex weak classifiers

• e.g., heart-disease dataset:

• stumps yield small edges

• also, small dataset

Trang 28

More Theory

• many other ways of understanding AdaBoost:

• as playing a repeated two-person matrixgame

• weak learning assumption and optimal margin havenatural game-theoretic interpretations

• special case of more general game-playing algorithm

• as a method for minimizing a particular lossfunction vianumerical techniques, such ascoordinate descent

• using convex analysisin an “information-geometric”framework that includeslogistic regression andmaximumentropy

• as a universally consistentstatistical method

• can also deriveoptimal boosting algorithm, and extend tocontinuous time

Trang 29

Practical Advantages of AdaBoost

• fast

• simpleand easy to program

• no parametersto tune (except T)

• flexible — can combine with anylearning algorithm

• no prior knowledge needed about weak learner

• provably effective, provided can consistently find rough rules

of thumb

→ shift in mind set — goal now is merely to find classifiersbarely better than random guessing

• versatile

• can use with data that is textual, numeric, discrete, etc

• has been extended to learning problems well beyondbinary classification

Trang 30

• performance of AdaBoost depends on dataandweak learner

• consistent with theory, AdaBoost canfailif

• weak classifiers toocomplex

→ overfitting

• weak classifiers tooweak (γt → 0 too quickly)

→ underfitting

→ low margins→ overfitting

• empirically, AdaBoost seems especially susceptible to uniformnoise

Trang 31

UCI Experiments

[with Freund]

• tested AdaBoost on UCI benchmarks

• used:

• C4.5 (Quinlan’s decision tree algorithm)

• “decision stumps”: very simple rules of thumb that test

on single attributes

-1

predict

+1 predict no yes height > 5 feet ? predict

-1

predict +1

no yes

eye color = brown ?

Trang 32

0 5 10 15 20 25 30

Trang 33

Application: Detecting Faces

[Viola & Jones]

• problem: find facesin photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

Trang 34

Application: Human-computer Spoken Dialogue

[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]

• application: automatic “store front” or “help desk” for AT&TLabs’ Natural Voices business

• caller can request demo, pricing information, technical

support, sales agent, etc

• interactive dialogue

Trang 35

text response

text

raw

recognizer speech automatic text−to−speech

category predicted

Human

manager dialogue

• NLU’s job: classify caller utterances into 24 categories(demo, sales rep,pricing info,yes,no, etc.)

• weak classifiers: test for presence of word or phrase

Tiêu đề	Boosting: Foundations and Algorithms
Tác giả	Rob Schapire
Chuyên ngành	Machine Learning

Định dạng
Số trang	35
Dung lượng	140,55 KB