Learning II: Lecture 21 - Introduction to Artificial Intelligence CS440/ECE448 Inductive learning method, Decision Trees, Learning Decision Trees, How can we do the classification? An ID tree consistent with the data.
Trang 1Learning II
Introduction to Artificial Intelligence
CS440/ECE448
Lecture 21
Trang 2Last lecture
• The (in)efficiency of exact inference with Bayes nets
• The learning problem
Trang 6Inductive learning method
• Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Trang 7Inductive learning method
• Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Trang 8Inductive learning method
• Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Trang 9Inductive learning method
Ockham’s razor: prefer the simplest consistent hypothesis
• Construct/adjust h to agree with f on training set.
(h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Trang 10Inductive Learning
• Given examples of some concepts and a description (features) for these concepts as training data, learn how to classify subsequent descriptions into one of the concepts
Trang 11Decision Trees
“Should I play tennis today?”
Note: A decision tree can be expressed as a disjunction of conjunctions
(Outlook = sunny) (Humidity = normal)
(Outlook = overcast) (Wind=Weak)
Trang 12Learning Decision Trees
• Inductive Learning
• Given a set of positive and negative training examples of a concept, can we learn a decision tree that can be used to appropriately classify other examples?
• Identification Trees: ID3 [ Quinlan, 1979 ]
Trang 13What on Earth causes people to get sunburns?
I don’t know, so let’s go to the beach and
collect some data.
Trang 14Sunburn data
Name Hair Height Swim Suit Color Lotion Result
Sarah Blond Average Yellow No Sunburned
Annie Blond Short Red No Sunburned Emily Red Average Blue No Sunburned
John Brown Average Blue No Fine Katie Blond Short Yellow Yes Fine
There are 3 x 3 x 3 x 2 = 54 possible feature vectors
Trang 15Exact Matching Method
• Construct a table recording observed cases.
• Use table lookup to classify new data.
• Problem: For realistic problems, exact matching can’t be used.
8 people and 54 possible feature vectors:
15% chance of finding an exact match.
Trang 16How can we do the classification?
• Nearest-neighbor method (but only if we can
establish a distance between feature vectors)
• Use identification trees:
An identification tree is a decision tree in which each set of possible conclusions is implicitly
established by a list of samples of known class
Trang 17An ID tree consistent with the data
PeteJohn
Trang 18Another consistent ID tree
Trang 19Blond Red Brown
Lotion used
Sarah Annie Emily
Pete John
Dana Alex Katie
No
Yes
Trang 20Then among blonds
Lotion used
Sarah Annie
Dana Katie
Trang 21Combining these two together …
PeteJohn
Trang 22Decision Tree Learning Algorithm
Trang 24• Let’s say we have a question which has n possible
answers and call them v i
• Let’s say that answer v i occurs with probability P(v i), then the information content (entropy) measured in bits of
knowing the answer is:
• One bit of information is enough information to answer a yes or no question.
• E.g consider flipping a fair coin, how much information
do you have if you know which side comes up?
I(½, ½) = - (½ log2½ + ½ log2½) = 1bit
v P v
Trang 25Information at a node
• In our decision tree for a given feature (e.g hair color), we have
– b: number of branches (e.g possible values for the feature) – Nb: number of samples in branch
– Np: number of samples in all branches
– Nbc: number of samples in class c in branch b.
• Using frequencies as an estimate of the probabilities, we have
• For a single branch, the information is simply
N N
N n
N n
Informatio log2
Trang 26• Consider a single branch (b=1) which only contains members of two
classes A and B.
– If half of the points belong to A and half belong to B:
– What if all the points belong to A (or to B):
• We like the latter situation since the branches are homogeneous, so
less information is needed to make a decision (maximize information
gain)
1 2
1 2
1 log 2
1 2
1 log 2 1
log n
Informatio
2 2
N N
N
0 0 0
0 log 0 1 log 1
log n
Informatio
2 2
N N
N
0 log
:
x
Trang 27What is the amount of information required for classification after we have used the hair test?
Hair Color
Sarah Annie
Dana Katie
Emily Alex
Pete John
Blond Red Brown
- 2/4 log22/4
- 2/4 log22/4
= 1
-1 log21 -0 log20
N N
N
n Informatio
N N
N
n Informatio
Trang 28Selecting top level feature
• Using the 8 samples we have so far, we get:
• Hair wins, least additional information needed for rest of classification.
• This is used to build the first level of the identification tree:
Hair Color
Sarah Annie
Dana Katie
Emily Alex
Pete John
Blond Red Brown
Trang 29Selecting second level feature
• Let’s consider the remaining features for the blond branch (4 samples)
Dana Katie
Emily Alex
Pete John
Blond Red Brown
Trang 30Thus we get to the tree we had arrived
PeteJohn
Trang 31Using the Identification tree as a
•If Blond and uses lotion, then OK
•If Blond and does not use lotion, then gets burned
•If red-haired, then gets burned
•If brown hair, then OK
Rules:
•If Blond and uses lotion, then OK
•If Blond and does not use lotion, then gets burned
•If red-haired, then gets burned
•If brown hair, then OK
Trang 32Performance measurement
How do we know that h ≈ f ?
1 Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
size