Basically, machine learning uses algorithms to solve categorical or regression problems by develop automated model to predict phenomenon base on provided set of data Besides giving pre
Trang 1Hanoi - Year 2020
Trang 2
TO PREDICT CUSTOMERS’ BEHAVIORS
SUPERVISOR: Dr Tran Duc Quynh STUDENT: Neuyen IIuong Ly
STUDENT ID: 16071293 COHORT:
MAJOR: MIS20164
Hanoi - Year 2020
Trang 3
LETTER OF DECLARATION
Thereby declare that the Graduation Project APPLYING RANDOM FOREST
BEHAVIORS is the results of my own research and has never been published in any work of others During the implementation process of this project, 1 have seriously taken research ethics; all findings of this projects are results of my own research and surveys; all references in this project are clearly cited according to regulations
Tlake full responsibilily for the fidelity of dhe uurnber and data and olher con(erts
of my graduation project
Hanoi, #* June 2020
Student
Nguyen Huong Ly
Trang 4ACKNOWLEDGEMENT
Firstly, I would like to express my sincere and appreciation toward Dr Tran Duc Quynh | am proud and honored to be guided and helped to finish my graduation thesis under his supervision
Secondly, I also would to express my gratitude ta teachers and professors wha have taught me to lave enough knowledge and skills lo be able lo finish my graduation thesis
Last but not least, sincere thanks to my family and friends who always stay by my side and encourage me overcome challenges during the process of writing graduation
Trang 5TABLE OF CONTENTS CUIAPTER L INTRODUCTION TO MACUINE LEARNING 1.1 Definition
1.2 Application
1.3 Clasaiffcatin ¬— ¬—
1.4, Advantages and disadvantages ¬—
1.5 Accuracy in machine learning
CIIAPTER 2 THEORETICAL BACKGROUND
2.2.2 Why random forest is better than decision tree
2.2.3 How random †orst works ¬—
2.2.4, Advantages and disadvantages
2.2.5 Applicatin ¬— ¬—
2.3 Multilayer perceptron
2.3.1 Definition
2.3.2 How muitilayer perceptron warks
2.3.3 Advantage and disadvantage ¬—
Trang 63.3.3 Model and result
Trang 7TABLE OF NOTATIONS AND ABBREVIATIONS
Trang 8LIST OF TABLE
‘Table 1 ConÊusion metrie ¬—
Tablo 2 Random foresL rosulls
Trang 9LIST OF CHARTS AND FIGURE
Vigure 1.Decision tree graph description
Figure 2 Nominal test condition
†igure 3 Continuous test condition
Tigure ⁄1.Continuous test conditiơn
Figure 5 Best split
EFigure 6.dentify best split
Trigure 7.Step spliting
Figure 8 Stop spliling
Ligure 9 Multilayer perceptron TH se
Figure 10 General description
Figure 11 Number of responses " Yes" and "No"
Kigure 12.Relationship of renew offer and response
Trang 10ABSTRACT
‘The overall purpose of this thesis is applying machine learning, random forest and multilayer perceptron specifically, to solve realistic problum The thesis consists of 3 parts, “Introduction to machine leaning” which briefly introduce the concept of machine learning and its application, “Theoretical background” presents the concept
of classifiers will be used in solving problem, lastly “Case study” applies all above theories into real-life problem After solving, there are some important values can be concluded, such as interesting insights into dataset and how to build the best possible prediction model, ete
Trang 11CHAPTER 1 INTRODUCTION TO MACHINE LEARNING 1.1 Definition
Machine learning is an application in AI technology and can be defined as the
study of machine that has ability to automatically learn and improve using fed data to
train without the need of constant programming or human interfere [1] Basically,
machine learning uses algorithms to solve categorical or regression problems by
develop automated model to predict phenomenon base on provided set of data
Besides giving prediction, by using machine learning and its applications, users
can have better insights into data, identify and distinguish similar patterns, determine
elements that most affect the end results This application is especially important
when experts need to find out what elements or variables drive the phenomenon For
example, when analyzing marketing campaign, rather than identify which campaign brings more profits it’s more usefull to determine which attributes drive the success
of campaign Using identified attributes managers can have a better ideas of how and
why customers’ response to marketing campaign and improve company’s next
strategies
1.2 Application
Machine learning implementation is more expanse each day as industries require
to process vast amount of data It helps enterprises to gain better insights and
understanding about acquired data to perform more effectively There are many fields
in which machine learning has proved its usefulness, namely [17]
— Financial: Machine learning is mostly used to provide insight and determine
fraud transaction It can help investors to identify valuable opportunities or
when to trade, importantly it can determine which profile is high risk or which
transaction is likely to be fraud
— Healthcare: As mentioned, machine learning helps expert in improving
diagnosis and treatment, especially in areas requiring detail and complex data
such as genetic or brain cell.
Trang 12— Retail: Machine learning has shown its importance in Retail industry recently
Tis implementalion can lead to iucreasing in sale quantity by analyxing
purchasing history and giving relevant recommendation, personalizing shopping experience
Others: Machine leaming application can be common in day-to-day life,
includes: virtual assistant (Siri, Google assistant, Alexa, clc), fave recognition
and “people you may know” on Facebook, spam filtering on email, ete
1.3 Classification
— Supervised learning: Algorithm leam from past labeled data and
comparison between its’ outcomes and actual end results, then model is
modified ascordingly to provide the most possible correct oulcome when
applying to similar unknown data set [171[181
— Unsupervised leanring: Algorithm is applied into a set of plain data without associated outcomes, leaving algorithm to figure out hidden pattern or
valuable information by itself It is mostly used in identifying meaning and
insighL into data, for instance, to determine customer segment to be treated
similarly in marketing campaign [17][18]
Semi-supervised learmmg: Algorithms is applied into a set of data consisting of both labeled and unlabeled data, as acquiring fully labeled data is costly, time and cffort — consuming Somi-supervised learning is used for the same purposes as supervised kind [1 71[18]
— Remforcement learnmg: Using unlabeled data like unsupervised learning, algorithm in Reinforcement leaming discovers error and trial to determine the greatesl reward Tt is mostly applicd in robolies, video games, ote
N78]
1.4, Advantages and disadvantages
1.4.1 Advantages
10
Trang 13Recently, machine learning has gained more fame and been implemented in more expanse fields due to its vast advantages [361[28]
Automation: As mentioned above, the top most useful advantages of machine learning js its’ ability to self-learn Using developed algorithms machine learning can constantly analysis real-time dala and provide required ouleames effectively However users do not complete rely on it, different
implementations require different algorithms and preprocessing steps
— Application: Nowadays there are many fields applying machine learning in
their business processes, namely medical, financial, meteorology, science,
technology, ete Its’ wide variety of application makes its worth to invest in
— Data handling: Along with the development of the world is the increasing in scope of data ‘he amount of data needs to be analysis and the requirement in
lime and accuracy make il is difficult, nol say impossible, for human cognitive
ability On the other hand, machine leaming can analysis enormous and real-
time set of data and provide accurate outcome variables
— Algorithm: Despite of its’ selfleam ability, machine learning cannot be
automated from the begiming The selection of algorithm remains @ manual
task which requires time and effort investment to have the best possible
outcome
Data: ‘The application of machine leaming is based on data, it needs data to be
able to function However applying, machine learning to an enormous set of
11
Trang 14data, ospevially real-time data updated consistently users may front the problem of data mconsistency, which is not a good sign for developed
algorithtns
1.5, Accuracy in machine learning
As mentioned above, machine learning was born to predict phenomenon using a set of collected data The prediction is applied in many realistic fields such as business, financial, medical, etc that require a relative high accuracy to be able to function probably One way of measuring prediction model performance in machine learning is using metric called confusion metric
Confusion metric is defined as “a performance measurement for machine learning classification problem where output can be two or more classes” |22) After finishing, building a prediction model, user can evaluate accuracy level once again and have an
overview of performanve of the model using confusion metric
False nogalive Truc positive
— True negative (TN): True negative is contrast to true positive, it measures the value of correctly predicted nogalive responmes, which mcans the model prediets
No and the actual outcome is No.
Trang 15— False positive (FP): False positive measures false values where model predicts
Yes but actual values are Na
— False negative (FN): False negative also measures false values but where model
predicts No and actual values are Yos
‘These four concepts are extremely important to remember and understand as it will
be used Lo measure the effectiveness and accuracy of a model
‘There are also four umportant calculated values that is used to evaluate accurate
level of a prediction model, namely [34
Acouracy: As the name suggested, acouracy score is used to measure the acouracy
of a prediction mode, whether it gives correct predictions or not It is really easy 1o understand and can be the first method user can think of wher review a model
It basically is the ratio of correctly predicted response and total responses
Although il is easy lo use, accuracy score is more effective applying on a balance
datasct (Yes and No responses is almost equal) and may not show much meaning when using on imbalance dataset
— Precision: Precision score is the ratio of coreetly predicted positive responses and total predicted positive response More specifically, it asks a question, out of those model predicts as posilive (in this case, Yes), how many actually responses positive (Yes) Precision score falls into the range of [0;1], the high the score, the
better the model
— Recall: Recall score is the ratio of correctly predicted positive responses and total
actual positive response In this example, the question is among those actual
response Yes, how mamy was captured by the model Similar to precision score, recall score also has the range from [0:1] and the higher the score, the better
FI score: FY score takes into consideration both recall and precision score
Beginner users may find it is difficult to understand f1 score deeply however, while accuracy score works best with balance dataset, fl score proves its
effectiveness applying on imbalance dataset
13
Trang 16CHAPTER 2 THEORETICAL BACKGROUND 2.1, Decision tree
2.1.1 Definition
Tree based model is one of the most effective method to propose a reliable and
understandable results and predictions, applying on mostly any kind of data
Decision tree is analgorithm falls into the category of supervised leaning It solves
problems by using tree model in which each class label is presented by a node and
attributes correspond to internal nodes [7]
Unlike other algorithms in the same supervised learning category, decision tree
outstanding in the fact that it can be applied to both classification and regression
problems
— Classification problems can be understood as it demands categorical outputs
variables, For example when users try to predict product price is below or
above customers’ expected level, it can be considered as classification problems
— Regression problems working with numerical or continuous data Its’ output
variables often are prediction about price, growth percentages , A specific example for regression problem is when users are asked to give prediction
about house price provided related house price dataset
Basically, decision tree is used to train machine from which it can give prediction
about output variables that can either be classification or numerical value based on
learning rule from provided date
2.1.2 Decision tree graph description
Trang 17Figure 1 Decision tree graph description
As mentioned above, decision tree uses tree model lo solve classilication or
regression problems ‘his is a simple example of a decision tree in which a person
decides to go out and play or stay at home and study depend on some criteria The
first eriloria is woather, if the weather is surmy he will surely go oul and play however,
if the weather is rainy another criteria is taken into consideration which is the wind
Tf the wind is weak he still want to go oul, on the ofher hand he will stay al, home and
— A node represent a test or a question (e.¢ whether the weather is sunny or
rainy), each branch splits from that node stands for different outcomes of the
question
Rooted tree is a tree model in which one specific node is designed to be root
and all paths are leaded away from that node
— If there is path lead from node t to node tr, node ti is parent of node t and node tz is child of node t)
Trang 18— Internal node is called when that node has one or more children, on the other hand lerminal node means that node has ne child Terminal node is also called leaf node, which basically is the decision if above conditions are satisfied
— Binary trec is a rooted tree in which all nodes have exactly 2 children cach
— The depth of a tree model measures by the longest path from a toot Lo a leaf
‘The size of a tree model is the total nodes in that tree
In general decision tree is a model using classifier to determine output from input,
presenied by a rooted Iree with cach node is a subspace of input and rool node is the whole input Nodes are then split into different children nodes represent
corresponding subspaces divided using split question s,
2.1.3 Induction
The most obvious and clear purpose of building a decision tree model to leam how
to parlilion a learning set Lo provide the most possible reliable oulcome variables
Decision tree solve a categorical or regression problem using tree based model by partitioning dataset into subset depend on the similarities of attributes
Tn the process of buildmg a model, users can find oul more than one way of explaining leaming set effectively, thereby the explanation containing the least
assumption will he preferred, which meaning choosing the simplest method applying
on dala or smallest Ice This is gencrally easy Lo understand when the smaller the
tree the more understandable and easier to read However choosing the best model
for data is nol only depend on the size of the tree ard still remams a difficult task
when small treo can present overfitting problem, contrarily too big of a model presents under-fitting situation (these two problem will be explained more clearly below)
In order to identify the best way to partition dataset into subspaces and optimize
some certain criteria, there are features needed to be taken into consideration:
Splitting process
* Select attribute test condition
16
Trang 19© Identify the best way to split When to stop splitting
2.1.3.1 Splitting process
a) Select attribute test condition: |13]
‘There are two critera to pay attention to when splitting a dataset: type of allribule and how many subsets should be split
— Nominal: Nominal attribute type contains only words no quantitative valucs and carol be sorted or ranked in any meaningful order by ilscll
Figure 2 Nominal test condition
In the simple examples above, weather dataset can be partitioned into subset
in different ways, all subsets are nominal type — Sumy, Outwast, Rainy — and cannot be sorted in a meaningfid way without corresponding sub data Depending on the data on hand, users can divide weather dataset into three
separaled subspaces or group bwo of them inlo one group
- Ordinal: Ordinal is also non-numerical values that contain a meaningful
order
17
Trang 20Figure 3 Ordinal test candition
In this example, wind datasct has three ordinal attributes — Strong, Medium and Weak ‘hese three attribute are categorized based on how strong was the wind that day and the results can be measured and sorted ina meaningful way
Sinnlar to previous examples, users can also divide wind dalasel in differen
ways, three separated subsets or combine two of them into one
Continuous: Unlike two above type of attribute, continuous type contain
quantitative values and can have infinitive values within a selected range of
number Tf with nominal and ordinal type user can simply divide subsel, using
their label, however with continuous type user needs to separate attribute into ranges or into binary decision
Trang 21The way of dividing continuous attributes into different ranges of number
is called discretization Ranges of number can be identified by equal division,
equal percentiles or clustermg
The second way to divide continuous attributes is binary decision, Instead
of partitioning numbers into many different ranges, user can identify a number
thal can split dataset inlo two meaningful subsets,
Identify the best split
Now that the selection of test condition was introduced, another task that
remains difficult for user is how to determine the best way of splitting that provides optimized meaningful results
For example user has a dataset containing the height of each student in a
school, the purpose is to draw out, valuable information about stndents” height
Average height
Figure 5 Best split
19
Trang 22The first way to split data user can think of is to calculate the average height of each class in that school however, as in the above graph, the differences belween each class is small and user eannol give any meaningful conclusion beside class A has the lowest average height while class (7 has the highest, but with the difference of lem that conclusion does not give any
Changing the way of splitting dataset, user partitions attribute based on
genders and now the dulTerence between two heights can be scen clearly and conchided a more meaningful result
Tn general, to have the best possible way of parlationing datasel, user need
to take into consideration that attributes in the same subset are
homogeneous, which means they are closely related or similar Contrarily, subsets are required ta be heterogeneous, they need to he distinguished from each other easily and yield noticeable differences
2.1.3.2 Stop splitting
After knowing how to split data, the next question would be “How does user know
when to stop partitioning?” This question is actually quite easy to understand and
answer, it bases on two criteria [38]:
Stop expanding tree model when all attributes fall into the same category
Weather Sunny Rainy
Trang 23- Stop expanding tree model when features ave all used up
dalasct
With the basics working similarly as human decision making process, decision tree is the one algorithra thal makes the most sense for beginner users, thereby it is easy to interpret and understandable
Decision tree can work well missing noises, missing values as these values does not affect the process of building a mode!
As a results of the above advantage, working with decision tree requires less pre-processing data, nonualizing or scaling dala
Tree models are usual small and compact compare with other classifiers Decision trec can also minimizes the size of datasct as the depth of a tree
is a lot smaller than the numbers of attributes in a dataset
‘As onc of the carlicst developed algorithm, decision tree is the foundation
of many other modem ones, includes random forest
2.1.4.2 Disadvantages: [10)|11 |
21
Trang 24— There are two typical disadvantages users can easily see using decision
tree, which are under-filing and overfilling To prevent this situation,
when partitioning dataset, users should not apply too strict or loose
c Overfitting: Complicating the dataset, makes the tree toa big and
complex urmecessarily, users will [ind it is difficult to undersland
and process a huge decision tree An overfitting tree is usually
complex and has high chance of errors
¢ Under-fitting Contrarily lo overfilling, under-fitting tree simplifies the dataset, resulting in a small and can miss important attributes
Although users can easily understand iL as it is: small however,
under-fitting tree can have many errors
— Decision tree is a greedy model, with each spill it will find the allribute that
is optimal for corresponding spilt but can results in not optimal in a whole
— Also as a grocdy model, decision tree can Jead to a more complex results unnecessarily to optimize each spilt as much as possible
— Although decision tree can work with both categorical and numerical data
however, it is inadequate applying on numerical data It can be inflexible as the way of splitting depends on Y-axis and X-axis, thereby results of
working with numbers cannot be as effective as other classifiers.
Trang 252.2 Rundom forest
2.2.1 Definition
Random forest is a group of decision tree ensembles built on different subsamples taken out of training datasct The cnscmble and averaging of many decision tree with different structures help to propose predictions with higher accuracy [8]
There are many decision tree existing within random forest, each was built on a
small different training data, subsampling from original dataset At each splitting node the best split is considered to optimize the model, this process is repeated until
all attribute at each leaf node fall under 1 category or reach max depth of the tree
Finally, the predictions are made by averaging Lhe results of cach trce (lor regression)
or taking ones with highest vote among trees (for categorical)
2.2.2 Why random forest is better than decision tree
As mentioned above decision tree is a greedy classifier, it choose the best possible
split al each split node to optimize the model although if can lead 10 differerd,
problems with model, such as overfitting or under-fitting Thercby random forest was developed to overcome this defect
Decision tree fimetions well and partition training, dataset to optimize final results, especially when user does not set max depth for the model however, the purpose of decision tree is not only applied well for training data but fanctions well and predict
accurately on new data
‘There are 2 possible cases when building a decision tree model, overfitting and under-fitting:
© Overfitting: when model becomes too big and complex, also known as flexible
model TL over partitions dataset, memorives both actual relations and noises,
making overfittmg model is no longer accurate Flexible model is said to have high variance when a small change in training data leads to a considerable change in model
23