Applying random forest and neural network model to predict customers behaviors

Accuracy in machine learning As mentioned above, machine learning was born to predict phenomenon using a set of collected data.. Basically, decision tree is used to train machine from wh

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

GRADUATION PROJECT

PROJECT NAME

APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL

TO PREDICT CUSTOMERS’ BEHAVIORS

Student’s name

NGUYEN HUONG LY

Hanoi - Year 2020

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

GRADUATION PROJECT

PROJECT NAME APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL

TO PREDICT CUSTOMERS’ BEHAVIORS

SUPERVISOR: Dr Tran Duc Quynh

STUDENT: Nguyen Huong Ly

STUDENT ID: 16071293

COHORT:

MAJOR: MIS2016A

Hanoi - Year 2020

Trang 3

LETTER OF DECLARATION

I hereby declare that the Graduation Project APPLYING RANDOM FOREST AND NEURAL NETWORK MODEL TO PREDICT CUSTOMERS’ BEHAVIORS is the results of my own research and has never been published in any

work of others During the implementation process of this project, I have seriously taken research ethics; all findings of this projects are results of my own research and surveys; all references in this project are clearly cited according to regulations

I take full responsibility for the fidelity of the number and data and other contents

of my graduation project

Hanoi, 4 th June 2020

Student

Nguyen Huong Ly

Trang 4

ACKNOWLEDGEMENT

Firstly, I would like to express my sincere and appreciation toward Dr Tran Duc Quynh I am proud and honored to be guided and helped to finish my graduation thesis under his supervision

Secondly, I also would to express my gratitude to teachers and professors who have taught me to have enough knowledge and skills to be able to finish my graduation thesis

Last but not least, sincere thanks to my family and friends who always stay by my side and encourage me overcome challenges during the process of writing graduation thesis

Hanoi, 4 th June 2020

Student

Trang 5

TABLE OF CONTENTS

CHAPTER I INTRODUCTION TO MACHINE LEARNING 9

1.1 Definition 9

1.2 Application 9

1.3 Classification 10

1.4 Advantages and disadvantages 10

1.5 Accuracy in machine learning 12

CHAPTER 2 THEORETICAL BACKGROUND 14

2.1 Decision tree 14

2.1.1 Definition 14

2.1.2 Decision tree graph description 14

2.1.3 Induction 16

2.1.4 Advantages and disadvantages 21

2.2 Random forest 23

2.2.2 Why random forest is better than decision tree 23

2.2.3 How random forest works 24

2.2.4 Advantages and disadvantages 25

2.2.5 Application 26

2.3 Multilayer perceptron 26

2.3.2 How multilayer perceptron works 27

2.3.3 Advantage and disadvantage 28

Trang 6

2.3.4 Application: 28

CHAPTER 3 CASE STUDY 30

3.1 Problem 30

3.1.1 Problem statement 30

3.1.2 Data description 30

3.2 Tools introduction 31

3.2.1 Python 31

3.2.2 Packages 32

3.3 Problem solving 33

3.3.1 Data insight 33

3.3.2 Data preprocessing 35

3.3.3 Model and result 36

3.3.4 Conclusion 41

REFERENCES 42

APPENDIX 45

Trang 7

TABLE OF NOTATIONS AND ABBREVIATIONS

Trang 8

LIST OF TABLE

Table 1 Confusion metric 12

Table 2 Random forest results 37

Table 3 MLP results 38

Table 4 Oversampling Random Forest 38

Table 5 Oversampling MLP 39

Table 6 Under-sampling Random Forest 40

Table 7 Under-sampling MLP 41

Trang 9

LIST OF CHARTS AND FIGURE

Figure 1.Decision tree graph description 15

Figure 2 Nominal test condition 17

Figure 3 Continuous test condition 18

Figure 4.Continuous test condition 19

Figure 5 Best split 19

Figure 6.Identify best split 20

Figure 7.Stop spliting 20

Figure 8.Stop spliting 21

Figure 9.Multilayer perceptron 27

Figure 10 General description 33

Figure 11.Number of responses "Yes" and "No" 34

Figure 12.Relationship of renew offer and response 35

Trang 10

ABSTRACT

The overall purpose of this thesis is applying machine learning, random forest and multilayer perceptron specifically, to solve realistic problem The thesis consists of 3 parts, “Introduction to machine learning” which briefly introduce the concept of machine learning and its application, “Theoretical background” presents the concept

of classifiers will be used in solving problem, lastly “Case study” applies all above theories into real-life problem After solving, there are some important values can be concluded, such as interesting insights into dataset and how to build the best possible prediction model, etc

Trang 11

CHAPTER 1 INTRODUCTION TO MACHINE LEARNING

1.1 Definition

Machine learning is an application in AI technology and can be defined as the study of machine that has ability to automatically learn and improve using fed data to train without the need of constant programming or human interfere [1] Basically, machine learning uses algorithms to solve categorical or regression problems by develop automated model to predict phenomenon base on provided set of data Besides giving prediction, by using machine learning and its applications, users can have better insights into data, identify and distinguish similar patterns, determine elements that most affect the end results This application is especially important when experts need to find out what elements or variables drive the phenomenon For example, when analyzing marketing campaign, rather than identify which campaign brings more profits it’s more useful to determine which attributes drive the success

of campaign Using identified attributes managers can have a better ideas of how and why customers’ response to marketing campaign and improve company’s next strategies

1.2 Application

Machine learning implementation is more expanse each day as industries require

to process vast amount of data It helps enterprises to gain better insights and understanding about acquired data to perform more effectively There are many fields

in which machine learning has proved its usefulness, namely [17]:

 Financial: Machine learning is mostly used to provide insight and determine fraud transaction It can help investors to identify valuable opportunities or when to trade, importantly it can determine which profile is high risk or which transaction is likely to be fraud

 Healthcare: As mentioned, machine learning helps expert in improving diagnosis and treatment, especially in areas requiring detail and complex data such as genetic or brain cell

Trang 12

 Retail: Machine learning has shown its importance in Retail industry recently Its implementation can lead to increasing in sale quantity by analyzing purchasing history and giving relevant recommendation, personalizing shopping experience

 Others: Machine learning application can be common in day-to-day life, includes: virtual assistant (Siri, Google assistant, Alexa, etc), face recognition and “people you may know” on Facebook, spam filtering on email, etc

1.3 Classification

 Supervised learning: Algorithm learn from past labeled data and comparison between its’ outcomes and actual end results, then model is modified accordingly to provide the most possible correct outcome when applying to similar unknown data set [17][18]

 Unsupervised learning: Algorithm is applied into a set of plain data without associated outcomes, leaving algorithm to figure out hidden pattern or valuable information by itself It is mostly used in identifying meaning and insight into data, for instance, to determine customer segment to be treated similarly in marketing campaign [17][18]

 Semi-supervised learning: Algorithms is applied into a set of data consisting of both labeled and unlabeled data, as acquiring fully labeled data is costly, time and effort – consuming Semi-supervised learning is used for the same purposes as supervised kind [17][18]

 Reinforcement learning: Using unlabeled data like unsupervised learning, algorithm in Reinforcement learning discovers error and trial to determine the greatest reward It is mostly applied in robotics, video games, etc [17][18]

1.4 Advantages and disadvantages

1.4.1 Advantages

Trang 13

Recently, machine learning has gained more fame and been implemented in more expanse fields due to its vast advantages [36][28]

 Automation: As mentioned above, the top most useful advantages of machine learning is its’ ability to self-learn Using developed algorithms machine learning can constantly analysis real-time data and provide required outcomes effectively However users do not complete rely on it, different implementations require different algorithms and preprocessing steps

 Application: Nowadays there are many fields applying machine learning in their business processes, namely medical, financial, meteorology, science, technology, etc Its’ wide variety of application makes its worth to invest in

 Data handling: Along with the development of the world is the increasing in scope of data The amount of data needs to be analysis and the requirement in time and accuracy make it is difficult, not say impossible, for human cognitive ability On the other hand, machine learning can analysis enormous and real-time set of data and provide accurate outcome variables

 Algorithm: Despite of its’ self-learn ability, machine learning cannot be automated from the beginning The selection of algorithm remains a manual task which requires time and effort investment to have the best possible outcome

 Data: The application of machine learning is based on data, it needs data to be able to function However applying machine learning to an enormous set of

Trang 14

data, especially real-time data updated consistently, users may front the problem of data inconsistency, which is not a good sign for developed algorithms

1.5 Accuracy in machine learning

As mentioned above, machine learning was born to predict phenomenon using a set of collected data The prediction is applied in many realistic fields such as business, financial, medical, etc that require a relative high accuracy to be able to function probably One way of measuring prediction model performance in machine learning is using metric called confusion metric

Confusion metric is defined as “a performance measurement for machine learning classification problem where output can be two or more classes” [22] After finishing

building a prediction model, user can evaluate accuracy level once again and have an overview of performance of the model using confusion metric

Predicted No (Negative)

Predicted Yes (Positive)

Actual No (Negative)

True negative False positive

Actual Yes (Positive)

False negative True positive

Table 1 Confusion metric

From this metric, there are four features that need to be considered [3]:

 True positive (TP): True positive is the value of correctly predicted positive responses, which means that the model predicts “Yes” and the actual outcome is also “Yes”

 True negative (TN): True negative is contrast to true positive, it measures the value of correctly predicted negative responses, which means the model predicts

No and the actual outcome is No

Trang 15

 False positive (FP): False positive measures false values where model predicts Yes but actual values are No

 False negative (FN): False negative also measures false values but where model predicts No and actual values are Yes

These four concepts are extremely important to remember and understand as it will

be used to measure the effectiveness and accuracy of a model

There are also four important calculated values that is used to evaluate accurate level of a prediction model, namely [34]:

 Accuracy: As the name suggested, accuracy score is used to measure the accuracy

of a prediction mode, whether it gives correct predictions or not It is really easy

to understand and can be the first method user can think of when review a model

It basically is the ratio of correctly predicted response and total responses Although it is easy to use, accuracy score is more effective applying on a balance dataset (Yes and No responses is almost equal) and may not show much meaning when using on imbalance dataset

 Precision: Precision score is the ratio of correctly predicted positive responses and total predicted positive response More specifically, it asks a question, out of those model predicts as positive (in this case, Yes), how many actually responses positive (Yes) Precision score falls into the range of [0;1], the high the score, the better the model

 Recall: Recall score is the ratio of correctly predicted positive responses and total actual positive response In this example, the question is among those actual response Yes, how many was captured by the model Similar to precision score, recall score also has the range from [0;1] and the higher the score, the better

 F1 score: F1 score takes into consideration both recall and precision score Beginner users may find it is difficult to understand f1 score deeply however, while accuracy score works best with balance dataset, f1 score proves its

effectiveness applying on imbalance dataset

Trang 16

CHAPTER 2 THEORETICAL BACKGROUND 2.1 Decision tree

Unlike other algorithms in the same supervised learning category, decision tree outstanding in the fact that it can be applied to both classification and regression problems

 Classification problems can be understood as it demands categorical outputs variables For example when users try to predict product price is below or above customers’ expected level, it can be considered as classification problems

 Regression problems working with numerical or continuous data Its’ output variables often are prediction about price, growth percentages , … A specific example for regression problem is when users are asked to give prediction about house price provided related house price dataset

Basically, decision tree is used to train machine from which it can give prediction about output variables that can either be classification or numerical value based on learning rule from provided date

2.1.2 Decision tree graph description

Trang 17

As mentioned above, decision tree uses tree model to solve classification or regression problems This is a simple example of a decision tree in which a person decides to go out and play or stay at home and study depend on some criteria The first criteria is weather, if the weather is sunny he will surely go out and play however,

if the weather is rainy another criteria is taken into consideration which is the wind

If the wind is weak he still want to go out, on the other hand he will stay at home and study

There are some features that need to be explained to be able to fully understand a tree model:

 A tree is a graph in which tips of two or more vertices are connected together

in one point

 A node represent a test or a question (e.g whether the weather is sunny or rainy), each branch splits from that node stands for different outcomes of the question

 Rooted tree is a tree model in which one specific node is designed to be root and all paths are leaded away from that node

 If there is path lead from node t1 to node t2, node t1 is parent of node t2 and node t2 is child of node t1

Figure 1.Decision tree graph description

Trang 18

 Internal node is called when that node has one or more children, on the other hand terminal node means that node has no child Terminal node is also called leaf node, which basically is the decision if above conditions are satisfied

 Binary tree is a rooted tree in which all nodes have exactly 2 children each

 The depth of a tree model measures by the longest path from a root to a leaf The size of a tree model is the total nodes in that tree

In general decision tree is a model using classifier to determine output from input, presented by a rooted tree with each node is a subspace of input and root node is the whole input Nodes are then split into different children nodes represent corresponding subspaces divided using split question st

2.1.3 Induction

The most obvious and clear purpose of building a decision tree model to learn how

to partition a learning set to provide the most possible reliable outcome variables Decision tree solve a categorical or regression problem using tree based model by partitioning dataset into subset depend on the similarities of attributes

In the process of building a model, users can find out more than one way of explaining learning set effectively, thereby the explanation containing the least assumption will be preferred, which meaning choosing the simplest method applying

on data or smallest tree This is generally easy to understand when the smaller the tree the more understandable and easier to read However choosing the best model for data is not only depend on the size of the tree and still remains a difficult task when small tree can present overfitting problem, contrarily too big of a model presents under-fitting situation (these two problem will be explained more clearly below)

In order to identify the best way to partition dataset into subspaces and optimize some certain criteria, there are features needed to be taken into consideration:

 Splitting process

 Select attribute test condition

Trang 19

 Identify the best way to split

 When to stop splitting

2.1.3.1 Splitting process

a) Select attribute test condition: [13]

There are two criteria to pay attention to when splitting a dataset: type of attribute and how many subsets should be split

 Nominal: Nominal attribute type contains only words, no quantitative values and cannot be sorted or ranked in any meaningful order by itself

In the simple examples above, weather dataset can be partitioned into subset

in different ways, all subsets are nominal type – Sunny, Outcast, Rainy – and cannot be sorted in a meaningful way without corresponding sub data Depending on the data on hand, users can divide weather dataset into three separated subspaces or group two of them into one group

- Ordinal: Ordinal is also non-numerical values that contain a meaningful order

Weather

[Sunny, Outcast]

Rainy

Figure 2 Nominal test condition

Trang 20

In this example, wind dataset has three ordinal attributes – Strong, Medium and Weak These three attribute are categorized based on how strong was the wind that day and the results can be measured and sorted in a meaningful way Similar to previous examples, users can also divide wind dataset in different ways, three separated subsets or combine two of them into one

 Continuous: Unlike two above type of attribute, continuous type contain quantitative values and can have infinitive values within a selected range of number If with nominal and ordinal type user can simply divide subset using their label, however with continuous type user needs to separate attribute into ranges or into binary decision

Wind

[Strong, medium]

Trang 21

The way of dividing continuous attributes into different ranges of number

is called discretization Ranges of number can be identified by equal division, equal percentiles or clustering

The second way to divide continuous attributes is binary decision Instead

of partitioning numbers into many different ranges, user can identify a number that can split dataset into two meaningful subsets

b) Identify the best split

Now that the selection of test condition was introduced, another task that remains difficult for user is how to determine the best way of splitting that provides optimized meaningful results

For example user has a dataset containing the height of each student in a school, the purpose is to draw out valuable information about students’ height

Trang 22

The first way to split data user can think of is to calculate the average height of each class in that school however, as in the above graph, the differences between each class is small and user cannot give any meaningful conclusion beside class A has the lowest average height while class C has the highest, but with the difference of 1cm that conclusion does not give any outstanding meaning

Changing the way of splitting dataset, user partitions attribute based on genders and now the difference between two heights can be seen clearly and concluded a more meaningful result

In general, to have the best possible way of partitioning dataset, user need

to take into consideration that attributes in the same subset are homogeneous, which means they are closely related or similar Contrarily, subsets are required to be heterogeneous, they need to be distinguished from each other easily and yield noticeable differences

2.1.3.2 Stop splitting

After knowing how to split data, the next question would be “How does user know when to stop partitioning?” This question is actually quite easy to understand and answer, it bases on two criteria [38]:

 Stop expanding tree model when all attributes fall into the same category

Trang 23

- Stop expanding tree model when features are all used up

2.1.4 Advantages and disadvantages

2.1.4.1 Advantages: [10][11]

 Decision tree is a non-parameter model, which means it needs no parameter involves in the process and uses no prior assumption onto its model, enables decision tree to recognize complex relations within the dataset

 With the basics working similarly as human decision making process, decision tree is the one algorithm that makes the most sense for beginner users, thereby it is easy to interpret and understandable

 Decision tree can work well missing noises, missing values as these values does not affect the process of building a model

 As a results of the above advantage, working with decision tree requires less pre-processing data, normalizing or scaling data

 Tree models are usual small and compact compare with other classifiers Decision tree can also minimizes the size of dataset as the depth of a tree

is a lot smaller than the numbers of attributes in a dataset

 As one of the earliest developed algorithm, decision tree is the foundation

of many other modern ones, includes random forest

Trang 24

 There are two typical disadvantages users can easily see using decision

tree, which are under-fitting and overfitting To prevent this situation,

when partitioning dataset, users should not apply too strict or loose

o Overfitting: Complicating the dataset, makes the tree too big and complex unnecessarily, users will find it is difficult to understand and process a huge decision tree An overfitting tree is usually complex and has high chance of errors

o Under-fitting: Contrarily to overfitting, under-fitting tree simplifies the dataset, resulting in a small and can miss important attributes

Although users can easily understand it as it is small however, under-fitting tree can have many errors

 Decision tree is a greedy model, with each spilt it will find the attribute that

is optimal for corresponding spilt but can results in not optimal in a whole

 Also as a greedy model, decision tree can lead to a more complex results

unnecessarily to optimize each spilt as much as possible

 Although decision tree can work with both categorical and numerical data

however, it is inadequate applying on numerical data It can be inflexible as

the way of splitting depends on Y-axis and X-axis, thereby results of

working with numbers cannot be as effective as other classifiers

Trang 25

2.2 Random forest

2.2.1 Definition

Random forest is a group of decision tree ensembles built on different subsamples taken out of training dataset The ensemble and averaging of many decision tree with different structures help to propose predictions with higher accuracy [8]

There are many decision tree existing within random forest, each was built on a small different training data, subsampling from original dataset At each splitting node the best split is considered to optimize the model, this process is repeated until all attribute at each leaf node fall under 1 category or reach max depth of the tree Finally, the predictions are made by averaging the results of each tree (for regression)

or taking ones with highest vote among trees (for categorical)

2.2.2 Why random forest is better than decision tree

As mentioned above decision tree is a greedy classifier, it choose the best possible split at each split node to optimize the model although it can lead to different problems with model, such as overfitting or under-fitting Thereby random forest was developed to overcome this defect

Decision tree functions well and partition training dataset to optimize final results, especially when user does not set max depth for the model however, the purpose of decision tree is not only applied well for training data but functions well and predict accurately on new data

There are 2 possible cases when building a decision tree model, overfitting and under-fitting:

 Overfitting: when model becomes too big and complex, also known as flexible model It over partitions dataset, memorizes both actual relations and noises, making overfitting model is no longer accurate Flexible model is said to have high variance when a small change in training data leads to a considerable change in model

Định dạng
Số trang	50
Dung lượng	1,41 MB