In chapter 2, we will practice solving a classification problem based on 3 different models Naive Bayes, k-Nearest Neighbor, and Decision Tree.. In chapter 4, we will show a theory, the
Trang 1TON DUC THANG UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY
THE MIDDLE-TERM ESSAY INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING’S PROBLEMS
Instructors: MR.LE ANH CUONG Student: LE QUANG DUY– 520H0529
TRAN QUOC HUY - 520H0647
Class: 20H50204 Course: 24
HO CHI MINH CITY, 2022
Trang 2VIETNAM GENERAL CONFEDERATION OF LABOR
TON DUC THANG UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY
THE MIDDLE-TERM ESSAY
INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING’S PROBLEMS
Instructors: MR.LE ANH CUONG Student: LE QUANG DUY– 520H0529
TRAN QUOC HUY - 520H0647
Class: 20H50204 Course: 24
HO CHI MINH CITY, 2022
Trang 4MIDDLE-TERM ESSAY COMPLETED AT TON DUC THANG
UNIVERSITY
I hereby declare that this is my own report and is under the guidance of Mr Le Anh Cuong The research contents and results on this topic are honest and have not been published in any form before The data in the tables for analysis, comments, and evaluation are collected by the author himself from different sources, clearly stated in the reference section.
In addition, the project also uses a number of comments, assessments as well as data from other authors, other agencies, and organizations, with citations and source annotations.
If I find any fraud I take full responsibility for the content of my report Ton
Duc Thang University is not related to copyright and copyright violations caused by me during the implementation process (if any).
Ho Chi Minh City, 16 October 2022
Author
Le Quang Duy
Trang 5TEACHER’S CONFIRMATION AND ASSESSMENT SECTION
Confirmation section of the instructors
_
Ho Chi Minh City, day month year (sign and write full name)
The evaluation part of the lecturer marks the report
_
Ho Chi Minh City, day month year (sign and write full name)
Trang 6In this report, we will discuss basic methods for machine learning.
In chapter 2, we will practice solving a classification problem based on 3 different models (Naive Bayes, k-Nearest Neighbor, and Decision Tree) And compare these models based
on metrics: accuracy, precision, recall, f1-score for each class, and weighted average of f1-score for all the data
In chapter 3, we will discuss, work on, and visualize the Feature Selection problem, and the way it (“correlation”) works
In chapter 4, we will show a theory, the code implementation, and the code’s illustration for 2 algorithms of optimization (Stochastic Gradient Descent and Adam Optimization Algorithm)
Trang 7TABLE OF CONTENTS
Trang 8LIST OF ABBREVIATIONS
Trang 9LIST OF DIAGRAMS, CHARTS, AND TABLES
Trang 10CHAPTER 1: INTRODUCE
In this report, we divided into 3 parts to solve 3 problems with 4 chapters
_In chapter 1, we will introduce the outline of the report
_In chapter 2, we will show 3 models which are: Naive Bayes Classification, k-Nearest Neighbors, and Decision Tree With each model, we will do a common preparation before training and testing After all,
we split data into 2 types: training (75%) and testing (25%), and make a comparison among standards: accuracy, precision, recall, f1 - score, and weighted average of f1-score
_In chapter 3, we will answer 2 questions: what it is and how it works it means we will show a theory about “correlation” in feature selection and solve Boston-house-pricing regression
_In chapter 4, we will show the theory of Adam and the Stochastic Gradient Descent Algorithm and show our code for each algorithm
Trang 11CHAPTER 2: PROBLEM 1
2.1 Common Preparing for 3 models:
_In this chapter, I will solve the problem with 3 models: Naive Bayes, k - Nearest Neighbors, and Decision Tree.
_We used the “iris” data set to visualize 3 of the models.
_First of all, we prepare for collecting data and reading file “iris.data”
Trang 12_Result after reading the file (only 10 headlines):
_Space of data:
_Performing data before training:
Trang 13Description: We chose 149 rows of 4 first columns and split them into 75% for training and 25% percent for testing through a group of variables: x_train, x_test, y_train, y_test 2.2 Execute models:
2.2.1 Naive Bayes model:
_ Training time: Take less than 1 second to train data.
_ Predicting time: Take less than 1 second to train data.
_ Checking error:
_ Result after checking:
Trang 14Conclusion: We only found 3 errors after running this model
2.2.2 k-Nearest Neighbors model:
_ Training time: Take less than 1 second to train data.
_ Predicting time: Take less than 1 second to train data.
_ Checking error:
_ Result after checking:
Trang 15Conclusion: We only found 3 errors after running this model
2.2.3 Decision Tree model:
_ Training time: Take less than 1 second to train data.
_ Predicting time: Take less than 1 second to train data.
_ Checking error:
_ Result after checking:
Trang 16Conclusion: We only found 3 errors after running this model
2.3 Comparing:
2.3.1 Reporting from Naive Bayes Model:
Conclusion: Weighted f1-score of data: 92%
2.3.2 Reporting from k-Nearest Neighbors Model:
Conclusion: Weighted f1-score of data: 92%
Trang 172.3.3 Reporting from Decision Tree Model:
Conclusion: Weighted f1-score of data: 87%
Trang 183.2 How it works to help?[1]
High correlation features are more linearly dependent and hence virtually equally affect the dependent variable We can thus exclude one of the two features when there is a substantial correlation between the two features
For example: We used pricing which existed in scikit-learn library to analyze, Boston pricing:
house-After loading data, we have:
We divided data into 2 sets: training (70%) and testing (30%):
Trang 19We used “heatmap” to visualize data:
As we can see, the number in each square is the percent that they correlate together so we can reject 1 in 2
of them In this instance, “tax” column with “rad” row is up to 0.91 which means relative up to 91% so
we can remove one of them from the data set Thresholds are often used which are 70% to 90%
In this situation, we used 70% for the threshold to reject attributes unnecessary
Our function of correlation:
Trang 20We return a set of names of rejecting and prepare for our dataset:
After rejecting, we lost 3 attributes with only 10 columns (13 columns before):
3.3 Solving linear regression’s problem:
After all, we solve this problem with linear regression:
Predicting values:
Trang 21Checking MAE test score:
Trang 22Stochastic Gradient Descent is especially useful when there are redundancies in the data.
As we can see, we have to draw a line as linear and we have a formula to predict a height:
Predicted Height = Intercept + Slope x Weight (1)
In this instance, we can see 3 clusters, and we can choose randomly intercept = 0 and slope = 1
As we knew, we can use a “Loss Function” to determine how it fit the data:
Trang 23Sum of squared residuals = (Observed Height - Predicted Height) (2) 2
Replace (1) into (2), we have:
We have to calculate the derivative of the sum of squared residuals with respect to the intercept and slope:
We can pick randomly 1 sample to calculate the derivative:
f(Sum of squared residuals) intercept ’ = -2 (3.3 - (0 +1 x 3)) = -0.6
f(Sum of squared residuals) slope ’ = -2 x 3 (3.3 - (0 + 1 x 3 )) = -1.8
We can easily calculate the step size to improve the line:
Step size intercept = f(Sum of squared residuals) intercept ’ x learning rate Step size slope = f(Sum of squared residuals) slope ’ x learning rate
We have to start with large Learning Rate and make it smaller with each step
In this example, we chose 0.01 for learning rate:
Step size intercept = f(Sum of squared residuals) intercept ’ x learning rate = -0.6 x 0.01 = -0.006
Step size slope = f(Sum of squared residuals) slope ’ x learning rate = -1.8 x 0.01 = -0.018
=> New intercept = Old intercept - Step size intercept = 0 - (0.006) = 0.006
Trang 24New slope = Old slope - Step size slope = 1 - (-0.018) = 1.018
We had a new line:
We iterate from pick another random sample to calculate until Loss Function is less than 0.001, we can stop:
Trang 25Having a new line:
Repeating step above and having a new line:
Trang 26And we can stop at intercept = 0.85 and slope = 0.68 in this instance:
When we have a new sample added, we use this sample and repeat each step before to create a new line which fits with data:
Result after adding new data point:
Trang 274.1.2 Show code:
Trang 284.2 Adam Optimization Algorithm
4.2.1 Theory:
Adam Optimization Algorithm also known as Adaptive Moment Estimation is a method for stochastic optimization It’s a kind of gradient descent optimization for machine learning (neural networks, etc), ADAM is a method created to improve the learning rate for machine learning
Adam is a stochastic objective function optimization algorithm based on first-order gradients and adaptiveestimation of low-order moments It’s a very efficient method when only first-order gradients are requiredwith low memory This method is also suitable for problems with unstable variability and fragmented training data
Pseudo code for Adam Algorithm:
Trang 29Note that we can improve above algorithm by changing the order of computation: Ê mấy file iris nữa Mấy file data m xài nữa Duy
Trang 304.2.2 Show code:
Trang 31[1]: https://www.kaggle.com/code/bbloggsbott/feature-selection-correlation-and-p-value/notebook[2]: https://www.phamduytung.com/blog/2021-01-15 -adabelief-optimizer/#:~:text=Adam%20%2D