Accuracy in machine learning As mentioned above, machine learning was born to predict phenomenon using a set of collected data.. Basically, decision tree is used to train machine from wh
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
GRADUATION PROJECT
PROJECT NAME
APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL
TO PREDICT CUSTOMERS’ BEHAVIORS
Student’s name
NGUYEN HUONG LY
Hanoi - Year 2020
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
GRADUATION PROJECT
PROJECT NAME APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL
TO PREDICT CUSTOMERS’ BEHAVIORS
SUPERVISOR: Dr Tran Duc Quynh
STUDENT: Nguyen Huong Ly
STUDENT ID: 16071293
COHORT:
MAJOR: MIS2016A
Hanoi - Year 2020
Trang 3LETTER OF DECLARATION
I hereby declare that the Graduation Project APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS’ BEHAVIORS is the results of my own research and has never been published in any
work of others During the implementation process of this project, I have seriously taken research ethics; all findings of this projects are results of my own research and surveys; all references in this project are clearly cited according to regulations
I take full responsibility for the fidelity of the number and data and other contents
of my graduation project
Hanoi, 4 th June 2020
Student
Nguyen Huong Ly
Trang 4ACKNOWLEDGEMENT
Firstly, I would like to express my sincere and appreciation toward Dr Tran Duc Quynh I am proud and honored to be guided and helped to finish my graduation thesis under his supervision
Secondly, I also would to express my gratitude to teachers and professors who have taught me to have enough knowledge and skills to be able to finish my graduation thesis
Last but not least, sincere thanks to my family and friends who always stay by my side and encourage me overcome challenges during the process of writing graduation thesis
Hanoi, 4 th June 2020
Student
Trang 5TABLE OF CONTENTS
CHAPTER I INTRODUCTION TO MACHINE LEARNING 9
1.1 Definition 9
1.2 Application 9
1.3 Classification 10
1.4 Advantages and disadvantages 10
1.5 Accuracy in machine learning 12
CHAPTER 2 THEORETICAL BACKGROUND 14
2.1 Decision tree 14
2.1.1 Definition 14
2.1.2 Decision tree graph description 14
2.1.3 Induction 16
2.1.4 Advantages and disadvantages 21
2.2 Random forest 23
2.2.1 Definition 23
2.2.2 Why random forest is better than decision tree 23
2.2.3 How random forest works 24
2.2.4 Advantages and disadvantages 25
2.2.5 Application 26
2.3 Multilayer perceptron 26
2.3.1 Definition 26
2.3.2 How multilayer perceptron works 27
2.3.3 Advantage and disadvantage 28
Trang 62.3.4 Application: 28
CHAPTER 3 CASE STUDY 30
3.1 Problem 30
3.1.1 Problem statement 30
3.1.2 Data description 30
3.2 Tools introduction 31
3.2.1 Python 31
3.2.2 Packages 32
3.3 Problem solving 33
3.3.1 Data insight 33
3.3.2 Data preprocessing 35
3.3.3 Model and result 36
3.3.4 Conclusion 41
REFERENCES 42
APPENDIX 45
Trang 7TABLE OF NOTATIONS AND ABBREVIATIONS
Trang 8LIST OF TABLE
Table 1 Confusion metric 12
Table 2 Random forest results 37
Table 3 MLP results 38
Table 4 Oversampling Random Forest 38
Table 5 Oversampling MLP 39
Table 6 Under-sampling Random Forest 40
Table 7 Under-sampling MLP 41
Trang 9LIST OF CHARTS AND FIGURE
Figure 1.Decision tree graph description 15
Figure 2 Nominal test condition 17
Figure 3 Continuous test condition 18
Figure 4.Continuous test condition 19
Figure 5 Best split 19
Figure 6.Identify best split 20
Figure 7.Stop spliting 20
Figure 8.Stop spliting 21
Figure 9.Multilayer perceptron 27
Figure 10 General description 33
Figure 11.Number of responses "Yes" and "No" 34
Figure 12.Relationship of renew offer and response 35
Trang 10ABSTRACT
The overall purpose of this thesis is applying machine learning, random forest and multilayer perceptron specifically, to solve realistic problem The thesis consists of 3 parts, “Introduction to machine learning” which briefly introduce the concept of machine learning and its application, “Theoretical background” presents the concept
of classifiers will be used in solving problem, lastly “Case study” applies all above theories into real-life problem After solving, there are some important values can be concluded, such as interesting insights into dataset and how to build the best possible prediction model, etc
Trang 11CHAPTER 1 INTRODUCTION TO MACHINE LEARNING
1.1 Definition
Machine learning is an application in AI technology and can be defined as the study of machine that has ability to automatically learn and improve using fed data to train without the need of constant programming or human interfere [1] Basically, machine learning uses algorithms to solve categorical or regression problems by develop automated model to predict phenomenon base on provided set of data Besides giving prediction, by using machine learning and its applications, users can have better insights into data, identify and distinguish similar patterns, determine elements that most affect the end results This application is especially important when experts need to find out what elements or variables drive the phenomenon For example, when analyzing marketing campaign, rather than identify which campaign brings more profits it’s more useful to determine which attributes drive the success
of campaign Using identified attributes managers can have a better ideas of how and why customers’ response to marketing campaign and improve company’s next strategies
1.2 Application
Machine learning implementation is more expanse each day as industries require
to process vast amount of data It helps enterprises to gain better insights and understanding about acquired data to perform more effectively There are many fields
in which machine learning has proved its usefulness, namely [17]:
Financial: Machine learning is mostly used to provide insight and determine fraud transaction It can help investors to identify valuable opportunities or when to trade, importantly it can determine which profile is high risk or which transaction is likely to be fraud
Healthcare: As mentioned, machine learning helps expert in improving diagnosis and treatment, especially in areas requiring detail and complex data such as genetic or brain cell
Trang 12 Retail: Machine learning has shown its importance in Retail industry recently Its implementation can lead to increasing in sale quantity by analyzing purchasing history and giving relevant recommendation, personalizing shopping experience
Others: Machine learning application can be common in day-to-day life, includes: virtual assistant (Siri, Google assistant, Alexa, etc), face recognition and “people you may know” on Facebook, spam filtering on email, etc
1.3 Classification
Supervised learning: Algorithm learn from past labeled data and comparison between its’ outcomes and actual end results, then model is modified accordingly to provide the most possible correct outcome when applying to similar unknown data set [17][18]
Unsupervised learning: Algorithm is applied into a set of plain data without associated outcomes, leaving algorithm to figure out hidden pattern or valuable information by itself It is mostly used in identifying meaning and insight into data, for instance, to determine customer segment to be treated similarly in marketing campaign [17][18]
Semi-supervised learning: Algorithms is applied into a set of data consisting of both labeled and unlabeled data, as acquiring fully labeled data is costly, time and effort – consuming Semi-supervised learning is used for the same purposes as supervised kind [17][18]
Reinforcement learning: Using unlabeled data like unsupervised learning, algorithm in Reinforcement learning discovers error and trial to determine the greatest reward It is mostly applied in robotics, video games, etc [17][18]
1.4 Advantages and disadvantages
1.4.1 Advantages
Trang 13Recently, machine learning has gained more fame and been implemented in more expanse fields due to its vast advantages [36][28]
Automation: As mentioned above, the top most useful advantages of machine learning is its’ ability to self-learn Using developed algorithms machine learning can constantly analysis real-time data and provide required outcomes effectively However users do not complete rely on it, different implementations require different algorithms and preprocessing steps
Application: Nowadays there are many fields applying machine learning in their business processes, namely medical, financial, meteorology, science, technology, etc Its’ wide variety of application makes its worth to invest in
Data handling: Along with the development of the world is the increasing in scope of data The amount of data needs to be analysis and the requirement in time and accuracy make it is difficult, not say impossible, for human cognitive ability On the other hand, machine learning can analysis enormous and real-time set of data and provide accurate outcome variables
Algorithm: Despite of its’ self-learn ability, machine learning cannot be automated from the beginning The selection of algorithm remains a manual task which requires time and effort investment to have the best possible outcome
Data: The application of machine learning is based on data, it needs data to be able to function However applying machine learning to an enormous set of
Trang 14data, especially real-time data updated consistently, users may front the problem of data inconsistency, which is not a good sign for developed algorithms
1.5 Accuracy in machine learning
As mentioned above, machine learning was born to predict phenomenon using a set of collected data The prediction is applied in many realistic fields such as business, financial, medical, etc that require a relative high accuracy to be able to function probably One way of measuring prediction model performance in machine learning is using metric called confusion metric
Confusion metric is defined as “a performance measurement for machine learning classification problem where output can be two or more classes” [22] After finishing
building a prediction model, user can evaluate accuracy level once again and have an overview of performance of the model using confusion metric
Predicted No (Negative)
Predicted Yes (Positive)
Actual No (Negative)
True negative False positive
Actual Yes (Positive)
False negative True positive
Table 1 Confusion metric
From this metric, there are four features that need to be considered [3]:
True positive (TP): True positive is the value of correctly predicted positive responses, which means that the model predicts “Yes” and the actual outcome is also “Yes”
True negative (TN): True negative is contrast to true positive, it measures the value of correctly predicted negative responses, which means the model predicts
No and the actual outcome is No
Trang 15 False positive (FP): False positive measures false values where model predicts Yes but actual values are No
False negative (FN): False negative also measures false values but where model predicts No and actual values are Yes
These four concepts are extremely important to remember and understand as it will
be used to measure the effectiveness and accuracy of a model
There are also four important calculated values that is used to evaluate accurate level of a prediction model, namely [34]:
Accuracy: As the name suggested, accuracy score is used to measure the accuracy
of a prediction mode, whether it gives correct predictions or not It is really easy
to understand and can be the first method user can think of when review a model
It basically is the ratio of correctly predicted response and total responses Although it is easy to use, accuracy score is more effective applying on a balance dataset (Yes and No responses is almost equal) and may not show much meaning when using on imbalance dataset
Precision: Precision score is the ratio of correctly predicted positive responses and total predicted positive response More specifically, it asks a question, out of those model predicts as positive (in this case, Yes), how many actually responses positive (Yes) Precision score falls into the range of [0;1], the high the score, the better the model
Recall: Recall score is the ratio of correctly predicted positive responses and total actual positive response In this example, the question is among those actual response Yes, how many was captured by the model Similar to precision score, recall score also has the range from [0;1] and the higher the score, the better
F1 score: F1 score takes into consideration both recall and precision score Beginner users may find it is difficult to understand f1 score deeply however, while accuracy score works best with balance dataset, f1 score proves its
effectiveness applying on imbalance dataset
Trang 16CHAPTER 2 THEORETICAL BACKGROUND 2.1 Decision tree
Unlike other algorithms in the same supervised learning category, decision tree outstanding in the fact that it can be applied to both classification and regression problems
Classification problems can be understood as it demands categorical outputs variables For example when users try to predict product price is below or above customers’ expected level, it can be considered as classification problems
Regression problems working with numerical or continuous data Its’ output variables often are prediction about price, growth percentages , … A specific example for regression problem is when users are asked to give prediction about house price provided related house price dataset
Basically, decision tree is used to train machine from which it can give prediction about output variables that can either be classification or numerical value based on learning rule from provided date
2.1.2 Decision tree graph description
Trang 17As mentioned above, decision tree uses tree model to solve classification or regression problems This is a simple example of a decision tree in which a person decides to go out and play or stay at home and study depend on some criteria The first criteria is weather, if the weather is sunny he will surely go out and play however,
if the weather is rainy another criteria is taken into consideration which is the wind
If the wind is weak he still want to go out, on the other hand he will stay at home and study
There are some features that need to be explained to be able to fully understand a tree model:
A tree is a graph in which tips of two or more vertices are connected together
in one point
A node represent a test or a question (e.g whether the weather is sunny or rainy), each branch splits from that node stands for different outcomes of the question
Rooted tree is a tree model in which one specific node is designed to be root and all paths are leaded away from that node
If there is path lead from node t1 to node t2, node t1 is parent of node t2 and node t2 is child of node t1
Figure 1.Decision tree graph description
Trang 18 Internal node is called when that node has one or more children, on the other hand terminal node means that node has no child Terminal node is also called leaf node, which basically is the decision if above conditions are satisfied
Binary tree is a rooted tree in which all nodes have exactly 2 children each
The depth of a tree model measures by the longest path from a root to a leaf The size of a tree model is the total nodes in that tree
In general decision tree is a model using classifier to determine output from input, presented by a rooted tree with each node is a subspace of input and root node is the whole input Nodes are then split into different children nodes represent corresponding subspaces divided using split question st
2.1.3 Induction
The most obvious and clear purpose of building a decision tree model to learn how
to partition a learning set to provide the most possible reliable outcome variables Decision tree solve a categorical or regression problem using tree based model by partitioning dataset into subset depend on the similarities of attributes
In the process of building a model, users can find out more than one way of explaining learning set effectively, thereby the explanation containing the least assumption will be preferred, which meaning choosing the simplest method applying
on data or smallest tree This is generally easy to understand when the smaller the tree the more understandable and easier to read However choosing the best model for data is not only depend on the size of the tree and still remains a difficult task when small tree can present overfitting problem, contrarily too big of a model presents under-fitting situation (these two problem will be explained more clearly below)
In order to identify the best way to partition dataset into subspaces and optimize some certain criteria, there are features needed to be taken into consideration:
Splitting process
Select attribute test condition
Trang 19 Identify the best way to split
When to stop splitting
2.1.3.1 Splitting process
a) Select attribute test condition: [13]
There are two criteria to pay attention to when splitting a dataset: type of attribute and how many subsets should be split
Nominal: Nominal attribute type contains only words, no quantitative values and cannot be sorted or ranked in any meaningful order by itself
In the simple examples above, weather dataset can be partitioned into subset
in different ways, all subsets are nominal type – Sunny, Outcast, Rainy – and cannot be sorted in a meaningful way without corresponding sub data Depending on the data on hand, users can divide weather dataset into three separated subspaces or group two of them into one group
- Ordinal: Ordinal is also non-numerical values that contain a meaningful order
Weather
Weather
[Sunny, Outcast]
Rainy
Figure 2 Nominal test condition
Trang 20In this example, wind dataset has three ordinal attributes – Strong, Medium and Weak These three attribute are categorized based on how strong was the wind that day and the results can be measured and sorted in a meaningful way Similar to previous examples, users can also divide wind dataset in different ways, three separated subsets or combine two of them into one
Continuous: Unlike two above type of attribute, continuous type contain quantitative values and can have infinitive values within a selected range of number If with nominal and ordinal type user can simply divide subset using their label, however with continuous type user needs to separate attribute into ranges or into binary decision
Wind
Wind
[Strong, medium]
Trang 21The way of dividing continuous attributes into different ranges of number
is called discretization Ranges of number can be identified by equal division, equal percentiles or clustering
The second way to divide continuous attributes is binary decision Instead
of partitioning numbers into many different ranges, user can identify a number that can split dataset into two meaningful subsets
b) Identify the best split
Now that the selection of test condition was introduced, another task that remains difficult for user is how to determine the best way of splitting that provides optimized meaningful results
For example user has a dataset containing the height of each student in a school, the purpose is to draw out valuable information about students’ height
Trang 22The first way to split data user can think of is to calculate the average height of each class in that school however, as in the above graph, the differences between each class is small and user cannot give any meaningful conclusion beside class A has the lowest average height while class C has the highest, but with the difference of 1cm that conclusion does not give any outstanding meaning
Changing the way of splitting dataset, user partitions attribute based on genders and now the difference between two heights can be seen clearly and concluded a more meaningful result
In general, to have the best possible way of partitioning dataset, user need
to take into consideration that attributes in the same subset are homogeneous, which means they are closely related or similar Contrarily, subsets are required to be heterogeneous, they need to be distinguished from each other easily and yield noticeable differences
2.1.3.2 Stop splitting
After knowing how to split data, the next question would be “How does user know when to stop partitioning?” This question is actually quite easy to understand and answer, it bases on two criteria [38]:
Stop expanding tree model when all attributes fall into the same category
Trang 23- Stop expanding tree model when features are all used up
2.1.4 Advantages and disadvantages
2.1.4.1 Advantages: [10][11]
Decision tree is a non-parameter model, which means it needs no parameter involves in the process and uses no prior assumption onto its model, enables decision tree to recognize complex relations within the dataset
With the basics working similarly as human decision making process, decision tree is the one algorithm that makes the most sense for beginner users, thereby it is easy to interpret and understandable
Decision tree can work well missing noises, missing values as these values does not affect the process of building a model
As a results of the above advantage, working with decision tree requires less pre-processing data, normalizing or scaling data
Tree models are usual small and compact compare with other classifiers Decision tree can also minimizes the size of dataset as the depth of a tree
is a lot smaller than the numbers of attributes in a dataset
As one of the earliest developed algorithm, decision tree is the foundation
of many other modern ones, includes random forest
Trang 24 There are two typical disadvantages users can easily see using decision
tree, which are under-fitting and overfitting To prevent this situation,
when partitioning dataset, users should not apply too strict or loose
o Overfitting: Complicating the dataset, makes the tree too big and complex unnecessarily, users will find it is difficult to understand and process a huge decision tree An overfitting tree is usually complex and has high chance of errors
o Under-fitting: Contrarily to overfitting, under-fitting tree simplifies the dataset, resulting in a small and can miss important attributes
Although users can easily understand it as it is small however, under-fitting tree can have many errors
Decision tree is a greedy model, with each spilt it will find the attribute that
is optimal for corresponding spilt but can results in not optimal in a whole
Also as a greedy model, decision tree can lead to a more complex results
unnecessarily to optimize each spilt as much as possible
Although decision tree can work with both categorical and numerical data
however, it is inadequate applying on numerical data It can be inflexible as
the way of splitting depends on Y-axis and X-axis, thereby results of
working with numbers cannot be as effective as other classifiers
Trang 252.2 Random forest
2.2.1 Definition
Random forest is a group of decision tree ensembles built on different subsamples taken out of training dataset The ensemble and averaging of many decision tree with different structures help to propose predictions with higher accuracy [8]
There are many decision tree existing within random forest, each was built on a small different training data, subsampling from original dataset At each splitting node the best split is considered to optimize the model, this process is repeated until all attribute at each leaf node fall under 1 category or reach max depth of the tree Finally, the predictions are made by averaging the results of each tree (for regression)
or taking ones with highest vote among trees (for categorical)
2.2.2 Why random forest is better than decision tree
As mentioned above decision tree is a greedy classifier, it choose the best possible split at each split node to optimize the model although it can lead to different problems with model, such as overfitting or under-fitting Thereby random forest was developed to overcome this defect
Decision tree functions well and partition training dataset to optimize final results, especially when user does not set max depth for the model however, the purpose of decision tree is not only applied well for training data but functions well and predict accurately on new data
There are 2 possible cases when building a decision tree model, overfitting and under-fitting:
Overfitting: when model becomes too big and complex, also known as flexible model It over partitions dataset, memorizes both actual relations and noises, making overfitting model is no longer accurate Flexible model is said to have high variance when a small change in training data leads to a considerable change in model