Naive Bayes Classification (NBC) là một thuật toán phân loại dựa trên tính toán xác suất áp dụng định lý Bayes mà ta đã tìm hiểu ở bài trước (xem bài trước tại đây). Thuật toán này thuộc nhóm Supervised Learning (Học có giám sát). Theo định lý Bayes, ta có công thức tính xác suất ngẫu nhiên của sự kiện yy khi biết xx như sau: P(y|x) = dfrac{P(x|y)P(y)}{P(x)} ~~~~~ (1) P(y∣x)= P(x) P(x∣y)P(y) (1) Giả sử ta phân chia 1 sự kiện xx thành nn thành phần khác nhau x_1, x_2, dots, x_nx 1 ,x 2 ,…,x n . Naive Bayes theo đúng như tên gọi dựa vào một giả thiết ngây thơ rằng x_1, x_2, dots x_nx 1 ,x 2 ,…x n là các thành phần độc lập với nhau. Từ đó ta có thể tính được: P(x|y) = P(x_1 cap x_2 cap dots cap x_n |y) = P(x_1|y) P(x_2|y) dots P(x_n|y) ~~~~~ (2) P(x∣y)=P(x 1 ∩x 2 ∩⋯∩x n ∣y)=P(x 1 ∣y)P(x 2 ∣y)…P(x n ∣y) (2) Do đó ta có: P(y|x) propto P(y) prodlimits_{i=1}n P(x_i|y) ~~~~~ (3) P(y∣x)∝P(y) i=1 ∏ n P(x i ∣y) (3)
Trang 1Gaussian Naive Bayes Classifier: Iris Data Set
Nguyen Van HaiNguyen Tien DungDao Anh HuyLuu Thanh Duy
Ton Duc Thang University
Trang 4q
Trang 5Iris Data Set 5
The data set has 4 independent variables and 1 dependentvariable that have 3 different classes with 150 instances
- The first 4 columns are the independent variables (features)
- The 5th column is the dependent variable (class)
Trang 6Iris Data Set 6
For example:
Figure: Random 5 Row Sample
Trang 7I P(class/features) : Posterior Probability
I P(class) : Class Prior Probability
I P(features/class) : Likelihood
I P(features) : Predictor Prior Probability
Trang 8I ’µ’ is the mean or expectation of the distribution,
I ’σ’ is the standard deviation, and
I ’σ2’ is the variance
Trang 9Prepare Data
q
Trang 10Load Data 10
Read in the raw data and convert each string into an integer
Trang 11Split Data 11
Split the data into a training set and a testing set
The weight will determine how much of the data will be inthe training set
Trang 12Group Data 12
Group the data according to class by mapping each class toindividual instances
Trang 13Summarize Data
q
Trang 14Mean 14
Calculate the mean
Trang 15Standard Deviation 15
Calculate the standard deviation
Trang 17Build Model
q
Trang 18Overview : Features and Class 18
Trang 19Overview : Bayes Tree Diagram 19
Trang 21Train 21
This is where we learn from the train set, by calculating themean and the standard deviation
Using the grouped classes, calculate the (mean, standard
deviation) combination for each feature of each class
The calculations will later use the (mean, standard
deviation) of each feature to calculate class likelihoods
Trang 22Likelihood 22
Likelihood is calculated by taking the product of all NormalProbabilities
P(features/class)
For each feature given the class we calculate the Normal
Probability using the Normal Distribution
P(Sl/S)P(Sw/S)P(Pl/S)P(Pw/S)
Trang 23Joint Probability 23
Joint Probability is calculated by taking the product of thePrior Probability and the Likelihood
P(S)P(Sl/S)P(Sw/S)P(Pl/S)P(Pw/S)
Trang 24Marginal Probability 24
Calculate the total sum of all joint probabilities
Marginal probs = P(S)P(Sl/S)P(Sw/S)P(Pl/S)P(Pw/S) +P(Ve)P(Sl/Ve)P(Sw/Ve)P(Pl/Ve)P(Pw/Ve) +
P(Vi)P(Sl/Vi)P(Sw/Vi)P(Pl/Vi)P(Pw/Vi)
Trang 25Posterior Probability 25
The Posterior Probability is the probability of a class
occuring and is calculated for each class given the new data
P(class/features)
This where all of the preceding class methods tie together tocalculate the Gauss Naive Bayes formula with the goal of
selecting MAP
Trang 26Test Model
q
Trang 27Get Maximum A Posterior 27
The get best posterior probability() method will call the
posterior probability() method on a single test row
For each test row we will calculate 3 Posterior Probabilities;one for each class The goal is to select MAP, the Maximum APosterior probability
The get best posterior probability() method will simply
choose the Maximum A Posterior probability and return the
associated class for the given test row
Trang 28Predict 28
This method will return a prediction for each test row
Trang 29Accuracy 29
Accuracy will test the performance of the model by takingthe total number of correct predictions and divide them by thetotal number of predictions This is critical in understandingthe veracity of the model
Trang 30q
Trang 31Code 31
Trang 32Code 32
Trang 33Code 33
Trang 34Code 34
Trang 35Code 35
Trang 36Code 36
Trang 37Code 37
Trang 38Code 38
Trang 39Code 39
Trang 40Code 40
Trang 41q
Trang 42Print the results 42
Trang 43Print the results 43
Trang 44THE END