Luận văn applying random forest and neural network model to predict customers' behaviors

Basically, machine learning uses algorithms to solve categorical or regression problems by develop automated model to predict phenomenon base on provided set of data Besides giving pre

Trang 1

Hanoi - Year 2020

Trang 2

TO PREDICT CUSTOMERS’ BEHAVIORS

SUPERVISOR: Dr Tran Duc Quynh STUDENT: Neuyen IIuong Ly

STUDENT ID: 16071293 COHORT:

MAJOR: MIS20164

Hanoi - Year 2020

Trang 3

LETTER OF DECLARATION

Thereby declare that the Graduation Project APPLYING RANDOM FOREST

BEHAVIORS is the results of my own research and has never been published in any work of others During the implementation process of this project, 1 have seriously taken research ethics; all findings of this projects are results of my own research and surveys; all references in this project are clearly cited according to regulations

Tlake full responsibilily for the fidelity of dhe uurnber and data and olher con(erts

of my graduation project

Hanoi, #* June 2020

Student

Nguyen Huong Ly

Trang 4

ACKNOWLEDGEMENT

Firstly, I would like to express my sincere and appreciation toward Dr Tran Duc Quynh | am proud and honored to be guided and helped to finish my graduation thesis under his supervision

Secondly, I also would to express my gratitude ta teachers and professors wha have taught me to lave enough knowledge and skills lo be able lo finish my graduation thesis

Last but not least, sincere thanks to my family and friends who always stay by my side and encourage me overcome challenges during the process of writing graduation

Trang 5

TABLE OF CONTENTS CUIAPTER L INTRODUCTION TO MACUINE LEARNING 1.1 Definition

1.2 Application

1.3 Clasaiffcatin ¬— ¬—

1.4, Advantages and disadvantages ¬—

1.5 Accuracy in machine learning

CIIAPTER 2 THEORETICAL BACKGROUND

2.2.2 Why random forest is better than decision tree

2.2.3 How random †orst works ¬—

2.2.4, Advantages and disadvantages

2.2.5 Applicatin ¬— ¬—

2.3 Multilayer perceptron

2.3.1 Definition

2.3.2 How muitilayer perceptron warks

2.3.3 Advantage and disadvantage ¬—

Trang 6

3.3.3 Model and result

Trang 7

TABLE OF NOTATIONS AND ABBREVIATIONS

Trang 8

LIST OF TABLE

‘Table 1 ConÊusion metrie ¬—

Tablo 2 Random foresL rosulls

Trang 9

LIST OF CHARTS AND FIGURE

Vigure 1.Decision tree graph description

Figure 2 Nominal test condition

†igure 3 Continuous test condition

Tigure ⁄1.Continuous test conditiơn

Figure 5 Best split

EFigure 6.dentify best split

Trigure 7.Step spliting

Figure 8 Stop spliling

Ligure 9 Multilayer perceptron TH se

Figure 10 General description

Figure 11 Number of responses " Yes" and "No"

Kigure 12.Relationship of renew offer and response

Trang 10

ABSTRACT

‘The overall purpose of this thesis is applying machine learning, random forest and multilayer perceptron specifically, to solve realistic problum The thesis consists of 3 parts, “Introduction to machine leaning” which briefly introduce the concept of machine learning and its application, “Theoretical background” presents the concept

of classifiers will be used in solving problem, lastly “Case study” applies all above theories into real-life problem After solving, there are some important values can be concluded, such as interesting insights into dataset and how to build the best possible prediction model, ete

Trang 11

CHAPTER 1 INTRODUCTION TO MACHINE LEARNING 1.1 Definition

Machine learning is an application in AI technology and can be defined as the

study of machine that has ability to automatically learn and improve using fed data to

train without the need of constant programming or human interfere [1] Basically,

machine learning uses algorithms to solve categorical or regression problems by

develop automated model to predict phenomenon base on provided set of data

Besides giving prediction, by using machine learning and its applications, users

can have better insights into data, identify and distinguish similar patterns, determine

elements that most affect the end results This application is especially important

when experts need to find out what elements or variables drive the phenomenon For

example, when analyzing marketing campaign, rather than identify which campaign brings more profits it’s more usefull to determine which attributes drive the success

of campaign Using identified attributes managers can have a better ideas of how and

why customers’ response to marketing campaign and improve company’s next

strategies

1.2 Application

Machine learning implementation is more expanse each day as industries require

to process vast amount of data It helps enterprises to gain better insights and

understanding about acquired data to perform more effectively There are many fields

in which machine learning has proved its usefulness, namely [17]

— Financial: Machine learning is mostly used to provide insight and determine

fraud transaction It can help investors to identify valuable opportunities or

when to trade, importantly it can determine which profile is high risk or which

transaction is likely to be fraud

— Healthcare: As mentioned, machine learning helps expert in improving

diagnosis and treatment, especially in areas requiring detail and complex data

such as genetic or brain cell.

Trang 12

— Retail: Machine learning has shown its importance in Retail industry recently

Tis implementalion can lead to iucreasing in sale quantity by analyxing

purchasing history and giving relevant recommendation, personalizing shopping experience

Others: Machine leaming application can be common in day-to-day life,

includes: virtual assistant (Siri, Google assistant, Alexa, clc), fave recognition

and “people you may know” on Facebook, spam filtering on email, ete

1.3 Classification

— Supervised learning: Algorithm leam from past labeled data and

comparison between its’ outcomes and actual end results, then model is

modified ascordingly to provide the most possible correct oulcome when

applying to similar unknown data set [171[181

— Unsupervised leanring: Algorithm is applied into a set of plain data without associated outcomes, leaving algorithm to figure out hidden pattern or

valuable information by itself It is mostly used in identifying meaning and

insighL into data, for instance, to determine customer segment to be treated

similarly in marketing campaign [17][18]

Semi-supervised learmmg: Algorithms is applied into a set of data consisting of both labeled and unlabeled data, as acquiring fully labeled data is costly, time and cffort — consuming Somi-supervised learning is used for the same purposes as supervised kind [1 71[18]

— Remforcement learnmg: Using unlabeled data like unsupervised learning, algorithm in Reinforcement leaming discovers error and trial to determine the greatesl reward Tt is mostly applicd in robolies, video games, ote

N78]

1.4, Advantages and disadvantages

1.4.1 Advantages

10

Trang 13

Recently, machine learning has gained more fame and been implemented in more expanse fields due to its vast advantages [361[28]

Automation: As mentioned above, the top most useful advantages of machine learning js its’ ability to self-learn Using developed algorithms machine learning can constantly analysis real-time dala and provide required ouleames effectively However users do not complete rely on it, different

implementations require different algorithms and preprocessing steps

— Application: Nowadays there are many fields applying machine learning in

their business processes, namely medical, financial, meteorology, science,

technology, ete Its’ wide variety of application makes its worth to invest in

— Data handling: Along with the development of the world is the increasing in scope of data ‘he amount of data needs to be analysis and the requirement in

lime and accuracy make il is difficult, nol say impossible, for human cognitive

ability On the other hand, machine leaming can analysis enormous and real-

time set of data and provide accurate outcome variables

— Algorithm: Despite of its’ selfleam ability, machine learning cannot be

automated from the begiming The selection of algorithm remains @ manual

task which requires time and effort investment to have the best possible

outcome

Data: ‘The application of machine leaming is based on data, it needs data to be

able to function However applying, machine learning to an enormous set of

11

Trang 14

data, ospevially real-time data updated consistently users may front the problem of data mconsistency, which is not a good sign for developed

algorithtns

1.5, Accuracy in machine learning

As mentioned above, machine learning was born to predict phenomenon using a set of collected data The prediction is applied in many realistic fields such as business, financial, medical, etc that require a relative high accuracy to be able to function probably One way of measuring prediction model performance in machine learning is using metric called confusion metric

Confusion metric is defined as “a performance measurement for machine learning classification problem where output can be two or more classes” |22) After finishing, building a prediction model, user can evaluate accuracy level once again and have an

overview of performanve of the model using confusion metric

False nogalive Truc positive

— True negative (TN): True negative is contrast to true positive, it measures the value of correctly predicted nogalive responmes, which mcans the model prediets

No and the actual outcome is No.

Trang 15

— False positive (FP): False positive measures false values where model predicts

Yes but actual values are Na

— False negative (FN): False negative also measures false values but where model

predicts No and actual values are Yos

‘These four concepts are extremely important to remember and understand as it will

be used Lo measure the effectiveness and accuracy of a model

‘There are also four umportant calculated values that is used to evaluate accurate

level of a prediction model, namely [34

Acouracy: As the name suggested, acouracy score is used to measure the acouracy

of a prediction mode, whether it gives correct predictions or not It is really easy 1o understand and can be the first method user can think of wher review a model

It basically is the ratio of correctly predicted response and total responses

Although il is easy lo use, accuracy score is more effective applying on a balance

datasct (Yes and No responses is almost equal) and may not show much meaning when using on imbalance dataset

— Precision: Precision score is the ratio of coreetly predicted positive responses and total predicted positive response More specifically, it asks a question, out of those model predicts as posilive (in this case, Yes), how many actually responses positive (Yes) Precision score falls into the range of [0;1], the high the score, the

better the model

— Recall: Recall score is the ratio of correctly predicted positive responses and total

actual positive response In this example, the question is among those actual

response Yes, how mamy was captured by the model Similar to precision score, recall score also has the range from [0:1] and the higher the score, the better

FI score: FY score takes into consideration both recall and precision score

Beginner users may find it is difficult to understand f1 score deeply however, while accuracy score works best with balance dataset, fl score proves its

effectiveness applying on imbalance dataset

13

Trang 16

CHAPTER 2 THEORETICAL BACKGROUND 2.1, Decision tree

2.1.1 Definition

Tree based model is one of the most effective method to propose a reliable and

understandable results and predictions, applying on mostly any kind of data

Decision tree is analgorithm falls into the category of supervised leaning It solves

problems by using tree model in which each class label is presented by a node and

attributes correspond to internal nodes [7]

Unlike other algorithms in the same supervised learning category, decision tree

outstanding in the fact that it can be applied to both classification and regression

problems

— Classification problems can be understood as it demands categorical outputs

variables, For example when users try to predict product price is below or

above customers’ expected level, it can be considered as classification problems

— Regression problems working with numerical or continuous data Its’ output

variables often are prediction about price, growth percentages , A specific example for regression problem is when users are asked to give prediction

about house price provided related house price dataset

Basically, decision tree is used to train machine from which it can give prediction

about output variables that can either be classification or numerical value based on

learning rule from provided date

2.1.2 Decision tree graph description

Trang 17

Figure 1 Decision tree graph description

As mentioned above, decision tree uses tree model lo solve classilication or

regression problems ‘his is a simple example of a decision tree in which a person

decides to go out and play or stay at home and study depend on some criteria The

first eriloria is woather, if the weather is surmy he will surely go oul and play however,

if the weather is rainy another criteria is taken into consideration which is the wind

Tf the wind is weak he still want to go oul, on the ofher hand he will stay al, home and

— A node represent a test or a question (e.¢ whether the weather is sunny or

rainy), each branch splits from that node stands for different outcomes of the

question

Rooted tree is a tree model in which one specific node is designed to be root

and all paths are leaded away from that node

— If there is path lead from node t to node tr, node ti is parent of node t and node tz is child of node t)

Trang 18

— Internal node is called when that node has one or more children, on the other hand lerminal node means that node has ne child Terminal node is also called leaf node, which basically is the decision if above conditions are satisfied

— Binary trec is a rooted tree in which all nodes have exactly 2 children cach

— The depth of a tree model measures by the longest path from a toot Lo a leaf

‘The size of a tree model is the total nodes in that tree

In general decision tree is a model using classifier to determine output from input,

presenied by a rooted Iree with cach node is a subspace of input and rool node is the whole input Nodes are then split into different children nodes represent

corresponding subspaces divided using split question s,

2.1.3 Induction

The most obvious and clear purpose of building a decision tree model to leam how

to parlilion a learning set Lo provide the most possible reliable oulcome variables

Decision tree solve a categorical or regression problem using tree based model by partitioning dataset into subset depend on the similarities of attributes

Tn the process of buildmg a model, users can find oul more than one way of explaining leaming set effectively, thereby the explanation containing the least

assumption will he preferred, which meaning choosing the simplest method applying

on dala or smallest Ice This is gencrally easy Lo understand when the smaller the

tree the more understandable and easier to read However choosing the best model

for data is nol only depend on the size of the tree ard still remams a difficult task

when small treo can present overfitting problem, contrarily too big of a model presents under-fitting situation (these two problem will be explained more clearly below)

In order to identify the best way to partition dataset into subspaces and optimize

some certain criteria, there are features needed to be taken into consideration:

Splitting process

* Select attribute test condition

16

Trang 19

2.1.3.1 Splitting process

a) Select attribute test condition: |13]

‘There are two critera to pay attention to when splitting a dataset: type of allribule and how many subsets should be split

— Nominal: Nominal attribute type contains only words no quantitative valucs and carol be sorted or ranked in any meaningful order by ilscll

Figure 2 Nominal test condition

In the simple examples above, weather dataset can be partitioned into subset

in different ways, all subsets are nominal type — Sumy, Outwast, Rainy — and cannot be sorted in a meaningfid way without corresponding sub data Depending on the data on hand, users can divide weather dataset into three

separaled subspaces or group bwo of them inlo one group

- Ordinal: Ordinal is also non-numerical values that contain a meaningful

order

17

Trang 20

Figure 3 Ordinal test candition

In this example, wind datasct has three ordinal attributes — Strong, Medium and Weak ‘hese three attribute are categorized based on how strong was the wind that day and the results can be measured and sorted ina meaningful way

Sinnlar to previous examples, users can also divide wind dalasel in differen

ways, three separated subsets or combine two of them into one

Continuous: Unlike two above type of attribute, continuous type contain

quantitative values and can have infinitive values within a selected range of

number Tf with nominal and ordinal type user can simply divide subsel, using

their label, however with continuous type user needs to separate attribute into ranges or into binary decision

Trang 21

The way of dividing continuous attributes into different ranges of number

is called discretization Ranges of number can be identified by equal division,

equal percentiles or clustermg

The second way to divide continuous attributes is binary decision, Instead

of partitioning numbers into many different ranges, user can identify a number

thal can split dataset inlo two meaningful subsets,

Identify the best split

Now that the selection of test condition was introduced, another task that

remains difficult for user is how to determine the best way of splitting that provides optimized meaningful results

For example user has a dataset containing the height of each student in a

school, the purpose is to draw out, valuable information about stndents” height

Average height

Figure 5 Best split

19

Trang 22

The first way to split data user can think of is to calculate the average height of each class in that school however, as in the above graph, the differences belween each class is small and user eannol give any meaningful conclusion beside class A has the lowest average height while class (7 has the highest, but with the difference of lem that conclusion does not give any

Changing the way of splitting dataset, user partitions attribute based on

genders and now the dulTerence between two heights can be scen clearly and conchided a more meaningful result

Tn general, to have the best possible way of parlationing datasel, user need

to take into consideration that attributes in the same subset are

homogeneous, which means they are closely related or similar Contrarily, subsets are required ta be heterogeneous, they need to he distinguished from each other easily and yield noticeable differences

2.1.3.2 Stop splitting

After knowing how to split data, the next question would be “How does user know

when to stop partitioning?” This question is actually quite easy to understand and

answer, it bases on two criteria [38]:

Stop expanding tree model when all attributes fall into the same category

Weather Sunny Rainy

Trang 23

- Stop expanding tree model when features ave all used up

dalasct

With the basics working similarly as human decision making process, decision tree is the one algorithra thal makes the most sense for beginner users, thereby it is easy to interpret and understandable

Decision tree can work well missing noises, missing values as these values does not affect the process of building a mode!

As a results of the above advantage, working with decision tree requires less pre-processing data, nonualizing or scaling dala

Tree models are usual small and compact compare with other classifiers Decision trec can also minimizes the size of datasct as the depth of a tree

is a lot smaller than the numbers of attributes in a dataset

‘As onc of the carlicst developed algorithm, decision tree is the foundation

of many other modem ones, includes random forest

2.1.4.2 Disadvantages: [10)|11 |

21

Trang 24

— There are two typical disadvantages users can easily see using decision

tree, which are under-filing and overfilling To prevent this situation,

when partitioning dataset, users should not apply too strict or loose

c Overfitting: Complicating the dataset, makes the tree toa big and

complex urmecessarily, users will [ind it is difficult to undersland

and process a huge decision tree An overfitting tree is usually

complex and has high chance of errors

¢ Under-fitting Contrarily lo overfilling, under-fitting tree simplifies the dataset, resulting in a small and can miss important attributes

Although users can easily understand iL as it is: small however,

under-fitting tree can have many errors

— Decision tree is a greedy model, with each spill it will find the allribute that

is optimal for corresponding spilt but can results in not optimal in a whole

— Also as a grocdy model, decision tree can Jead to a more complex results unnecessarily to optimize each spilt as much as possible

— Although decision tree can work with both categorical and numerical data

however, it is inadequate applying on numerical data It can be inflexible as the way of splitting depends on Y-axis and X-axis, thereby results of

working with numbers cannot be as effective as other classifiers.

Trang 25

2.2 Rundom forest

2.2.1 Definition

Random forest is a group of decision tree ensembles built on different subsamples taken out of training datasct The cnscmble and averaging of many decision tree with different structures help to propose predictions with higher accuracy [8]

There are many decision tree existing within random forest, each was built on a

small different training data, subsampling from original dataset At each splitting node the best split is considered to optimize the model, this process is repeated until

all attribute at each leaf node fall under 1 category or reach max depth of the tree

Finally, the predictions are made by averaging Lhe results of cach trce (lor regression)

or taking ones with highest vote among trees (for categorical)

2.2.2 Why random forest is better than decision tree

As mentioned above decision tree is a greedy classifier, it choose the best possible

split al each split node to optimize the model although if can lead 10 differerd,

problems with model, such as overfitting or under-fitting Thercby random forest was developed to overcome this defect

Decision tree fimetions well and partition training, dataset to optimize final results, especially when user does not set max depth for the model however, the purpose of decision tree is not only applied well for training data but fanctions well and predict

accurately on new data

‘There are 2 possible cases when building a decision tree model, overfitting and under-fitting:

model TL over partitions dataset, memorives both actual relations and noises,

making overfittmg model is no longer accurate Flexible model is said to have high variance when a small change in training data leads to a considerable change in model

23

Tiêu đề	Applying random forest and neural network model to predict customers' behaviors
Tác giả	Nguyen Huong Ly
Người hướng dẫn	Dr. Tran Duc Quynh
Trường học	Vietnam National University, Hanoi International Graduation School
Chuyên ngành	MIS
Thể loại	Graduation project
Năm xuất bản	2020
Thành phố	Hanoi

Định dạng
Số trang	50
Dung lượng	703,81 KB