FULL Code Đồ Án NAIVE BAYES CLASSIFIER

Naive Bayes Classification (NBC) là một thuật toán phân loại dựa trên tính toán xác suất áp dụng định lý Bayes mà ta đã tìm hiểu ở bài trước (xem bài trước tại đây). Thuật toán này thuộc nhóm Supervised Learning (Học có giám sát). Theo định lý Bayes, ta có công thức tính xác suất ngẫu nhiên của sự kiện yy khi biết xx như sau: P(y|x) = dfrac{P(x|y)P(y)}{P(x)} ~~~~~ (1) P(y∣x)= P(x) P(x∣y)P(y) (1) Giả sử ta phân chia 1 sự kiện xx thành nn thành phần khác nhau x_1, x_2, dots, x_nx 1 ,x 2 ,…,x n . Naive Bayes theo đúng như tên gọi dựa vào một giả thiết ngây thơ rằng x_1, x_2, dots x_nx 1 ,x 2 ,…x n là các thành phần độc lập với nhau. Từ đó ta có thể tính được: P(x|y) = P(x_1 cap x_2 cap dots cap x_n |y) = P(x_1|y) P(x_2|y) dots P(x_n|y) ~~~~~ (2) P(x∣y)=P(x 1 ∩x 2 ∩⋯∩x n ∣y)=P(x 1 ∣y)P(x 2 ∣y)…P(x n ∣y) (2) Do đó ta có: P(y|x) propto P(y) prodlimits_{i=1}n P(x_i|y) ~~~~~ (3) P(y∣x)∝P(y) i=1 ∏ n P(x i ∣y) (3)

Trang 1

Gaussian Naive Bayes Classifier: Iris Data Set

Nguyen Van HaiNguyen Tien DungDao Anh HuyLuu Thanh Duy

Ton Duc Thang University

Trang 4

q

Trang 5

Iris Data Set 5

The data set has 4 independent variables and 1 dependentvariable that have 3 different classes with 150 instances

- The first 4 columns are the independent variables (features)

- The 5th column is the dependent variable (class)

Trang 6

Iris Data Set 6

For example:

Figure: Random 5 Row Sample

Trang 7

I P(class/features) : Posterior Probability

I P(class) : Class Prior Probability

I P(features/class) : Likelihood

I P(features) : Predictor Prior Probability

Trang 8

I ’µ’ is the mean or expectation of the distribution,

I ’σ’ is the standard deviation, and

I ’σ2’ is the variance

Trang 9

Prepare Data

q

Trang 10

Load Data 10

Read in the raw data and convert each string into an integer

Trang 11

Split Data 11

Split the data into a training set and a testing set

The weight will determine how much of the data will be inthe training set

Trang 12

Group Data 12

Group the data according to class by mapping each class toindividual instances

Trang 13

Summarize Data

q

Trang 14

Mean 14

Calculate the mean

Trang 15

Standard Deviation 15

Calculate the standard deviation

Trang 17

Build Model

q

Trang 18

Overview : Features and Class 18

Trang 19

Overview : Bayes Tree Diagram 19

Trang 21

Train 21

This is where we learn from the train set, by calculating themean and the standard deviation

Using the grouped classes, calculate the (mean, standard

deviation) combination for each feature of each class

The calculations will later use the (mean, standard

deviation) of each feature to calculate class likelihoods

Trang 22

Likelihood 22

Likelihood is calculated by taking the product of all NormalProbabilities

P(features/class)

For each feature given the class we calculate the Normal

Probability using the Normal Distribution

P(Sl/S)P(Sw/S)P(Pl/S)P(Pw/S)

Trang 23

Joint Probability 23

Joint Probability is calculated by taking the product of thePrior Probability and the Likelihood

P(S)P(Sl/S)P(Sw/S)P(Pl/S)P(Pw/S)

Trang 24

Marginal Probability 24

Calculate the total sum of all joint probabilities

Marginal probs = P(S)P(Sl/S)P(Sw/S)P(Pl/S)P(Pw/S) +P(Ve)P(Sl/Ve)P(Sw/Ve)P(Pl/Ve)P(Pw/Ve) +

P(Vi)P(Sl/Vi)P(Sw/Vi)P(Pl/Vi)P(Pw/Vi)

Trang 25

Posterior Probability 25

The Posterior Probability is the probability of a class

occuring and is calculated for each class given the new data

P(class/features)

This where all of the preceding class methods tie together tocalculate the Gauss Naive Bayes formula with the goal of

selecting MAP

Trang 26

Test Model

q

Trang 27

Get Maximum A Posterior 27

The get best posterior probability() method will call the

posterior probability() method on a single test row

For each test row we will calculate 3 Posterior Probabilities;one for each class The goal is to select MAP, the Maximum APosterior probability

The get best posterior probability() method will simply

choose the Maximum A Posterior probability and return the

associated class for the given test row

Trang 28

Predict 28

This method will return a prediction for each test row

Trang 29

Accuracy 29

Accuracy will test the performance of the model by takingthe total number of correct predictions and divide them by thetotal number of predictions This is critical in understandingthe veracity of the model

Trang 30

q

Trang 31

Code 31

Trang 32

Code 32

Trang 33

Code 33

Trang 34

Code 34

Trang 35

Code 35

Trang 36

Code 36

Trang 37

Code 37

Trang 38

Code 38

Trang 39

Code 39

Trang 40

Code 40

Trang 41

q

Trang 42

Print the results 42

Trang 43

Print the results 43

Trang 44

THE END

Định dạng
Số trang	44
Dung lượng	726,25 KB
File đính kèm	Code.zip (799 KB)