Build machine learning models for that purpose using the Luxstay listings in Hanoi

To contribute the growing inter-est and immense literatureon applying Artificial Intelligence on predicting rentalprices, this paper attempts to build ma-chine learning models for that p

Trang 1

Data Analytics MSc Dissertation MTH775P, 2019/20

Disquisitiones Arithmeticæ Predicting the prices for breakfasts and beds

Hai Nam Nguyen, ID 161136118

Supervisor: Dr Martin Benning

A thesis presented for the degree of

Master of Science in Data Analytics

School of Mathematical Sciences Queen Mary University of London

Trang 2

Declaration of original work

This declaration is made on August 17, 2020.Student’s Declaration: I Student Name hereby declare that the work

in this thesis is my original work I have not copied from any other students’work, work of mine submitted elsewhere, or from any other sources exceptwhere due reference or acknowledgement is made explicitly in the text, norhas any part been written for me by another person

Referenced text has been flagged by:

1 Using italic fonts, and

2 using quotation marks “ ”, and

3 explicitly mentioning the source in the text

i

Trang 3

This work is dedicated to my niece Nguyen Le Tue An(Mochi), who has

brought a great source of joy to me and my family recently

Trang 4

Pricing and guessing the right prices are vital for both hosts and renters on sharing plat-form from internet based companies To contribute the growing inter-est and immense literatureon applying Artificial Intelligence on predicting rentalprices, this paper attempts to build ma-chine learning models for that purposeusing the Luxstay listings in Hanoi R2 score is used as the main criterion for themodel performance and the results show that Extreme GradientBoostings (XGB)

home-is the model with the best performance with R2= 0.62, beating the most phisticated machine learning model: Neural Networks

so-iii

Trang 5

3.1 Dataset 5

3.2 K-Fold Cross Validation 6

3.3 Measuring Model Accuracy 7

4 Methods 9 4.1 LASSO 9

4.1.1 FISTA 10

4.2 Random Forest 12

4.3 Gradient Boosting 14

4.4 Extreme Gradient Boosting 16

4.5 LightGBM 19

4.5.1 Gradient-based One-sided Sampling 20

4.5.2 Exclusive Feature Bundling 20

4.6 Neural Networks 23

4.6.1 Adam Algorithm 25

4.6.2 Backpropagation 26

iv

Trang 6

CONTENTS v

A Some special mathematical notations 32

A.1 Vector Norm 32

A.2 The Hadamard product 33

Trang 7

Chapter 1

Introduction

Since its establishment in 2016, Luxstay has become one of the most popular forms on home-sharing along with Airbnb in Vietnam with a network of more than15,000 listings The platform connects the guests’ demand to rent villas, houses,apartment, to hosts and vice versa Hence, providing a reasonable price willhelp hosts to gain a high and stable income and guests will get great experiences

plat-in new places Therefore, workplat-ing on the sensible predictor and suggestion ofLuxstay prices can generate a real-life value and practical application

Hanoi is the capital of Vietnam and has the second most listings on Luxstay.The city has been also ranked in top 10 destinations to visit by TripAdvisor As adynamic city with active bookings and listings, Hanoi can be a great example forthe study of Luxstay Pricing

In this paper, we build a price prediction model and compare the performance

of different methods using R2 as the main measure The input of our models isthe data scraped on the Hanoi page of the website which includes continuous andcategorical records about listings Then a number of methods including traditionalMachine Learning models (LASSO, random forest, gradient boosting), ExtremeGradient Boosting, LightGBM and neural network to predict prices of listings

1

Trang 8

Chapter 2

Literature Review

The sharing economy is a socio-economic system that arranges ”the based activity of obtaining, giving, or sharing the access to goods and services”through ”community-based online services” (J Hamari 2015) Home-sharing is one

peer-to-peer-of the sharing activities and it has been experienced a significant growth due to

a high demand from tourism (Guttentag 2015) Given that Luxstay is a startupfrom an emerging economy, the platform has not received much attention fromthe academic community as well as Airbnb, the leading company for this service(Wang & Nicolau 2017) Nevertheless, the Vietnamese home-sharing platformshas some similar characteristics to Airbnb as it is also an internet-based companythat coordinates the demand of short-term renters and hosts Therefore, it isworth to conduct a review some findings on Airbnb from recent papers

Gibbs et al (2017) stated that one of the biggest challenges of Airbnb waspricing the right prices by identifying the two key reasons for this issue Firstly,unlike the hotel business, where the prices are set by trained experts and industrybenchmarks, rental prices on Airbnb are normally determined by regular hots withlimited supports Secondly, instead of letting algorithm to control prices like Uberand Lyft, Airbnb leaves the prices to hosts to decide given that they might not

be well-informed Consequently, these two factors may lead to cause a ptentiallyfinancial loss and empirical evidence shows that incompetent pricing causes a loss

of 46% of additional revenue on Airbnb Hence, there have been an interest in the

2

Trang 9

CHAPTER 2 LITERATURE REVIEW 3

study of rental price prediction on the leading platform The two trends for thistopic are hedonic-based regression and artificial intelligence techniques

The term Hedonic is defined to describe ”the weighting of the relative tance of various components among others in constructing an index of usefulnessand desirability” (Goodman 1998) In other words, Hedonic pricing is to identifiesfactors and characteristics affecting an item price (Investopedia.com) Wang &Nicolau (2017) aimed to design a system to understand which features are im-portant input for an automated price suggestion on Airbnb using hedonic-baseregression approach The functional form used were Ordinary Least Squares andQuantile Regression to analyse 25 variables of 180,533 listings in 33 cities Theresult shows that features related to host attributes such as the number of theirlistings and the profile pictures are the most important features Among those,super host status, which reveals experienced hosts on the platform, is the best one.However, the authors also discussed the limitation of this analysis The approach

impor-is under some economic assumptions needed to be examined The assumption ofhosts’ rationality requires a qualitative check which is skipped in the study Gener-ally, the effectiveness of hedonic-based regression for price prediction is restricted

by the model assumptions and esimation (Selim 2009)

Another approach for price prediction is to apply artificial intelligence niques which mainly includes machine learning an neural network models Tang

tech-& Sangani (2015) produced a model fore price prediction for San Francisco ings To reduce the complexity of the task, they turned the regression probleminto a classification task that predict both the neighbour hood and price range of

list-a listing list-and Support Vector Mlist-achine wlist-as the mlist-ain model to be tuned Uniquely,they included images as inputs for the model by creating a visual dictionary tocategorise the image of a listing The result shows that while the price predictionachieves a high accuracy in the test set at 81.2%, the neighbourhood predictionsuffers from overfitting with a big gap between the train and test sets Alterna-tively, Cai & Han(2019) attempted to work on the regression problem using thelistings in Melbourne The study implemented l1 regularisation as feature selec-tion for all traditional machine learning methods and then compared to modelswithout it The result shows that the latter perform better overall and gradient

Trang 10

CHAPTER 2 LITERATURE REVIEW 4

boosting algorithm produces the best precision with R2= 0.6914 in the test set.Recently, another study of the listings in New York holds an interesting resultwith an highest R2 of 0.7768 (Kalehbasti et al 2019) To gain that score, theyperformed a logarithmic transformation to the prices and then train their mod-els Additionally, they also attempted to compare three feature selection methods,which are manual selection, p-value and LASSO The analysis shows that p-valueand LASSO outperformed manual selection and the best method to be applied inthe paper is LASSO

In this paper, we applied the knowledge of the last three studies to build ourprice predictor for the listings on Luxstay Apart from widely used traditionalmachine learning methods and neural networks, we also attempted to code analgorithm to compute LASSO regression ourselves and used the two recent gradi-ent boosting technique, Extreme Gradient Boosting and LightGBM The projectworked on the original rental prices to produce a price prediction without anylogarithmic transformation

Trang 11

Chapter 3

Experimental Design

Figure 3.1: Example of Luxstay Listings

Our dataset of Luxstay listings was scraped using BeautifulSoup package onPython (Richardson 2007) It includes 2675 listings posted in Hanoi on 27 De-cember, 2019 Each listing contains fields describing the offered price (in dollar),

5

Trang 12

CHAPTER 3 EXPERIMENTAL DESIGN 6

district, type of home, name of its building, numbers of guests allowed, bedroomsand bathrooms

In order to make the dataset become available inputs for machine learningmodels, we went through few pre-processing steps Firstly, we droped features thatare not related to the prices such as listing id, listing name and listing link.Secondly, we used dummy variable encoding to solve the issue with categoricalfeatures which some machine learning algorithms can not work with directly Acategorical variable is a variable that assign an observation to a specific group ornominal category on the basis of some qualitative property (Yates et al 2003) Adummy variable is a binary variable that stores values of 0 and 1 where the formerrepresents the absence of a category and the latter shows the presence (James

H Stock 2020, p 186) The number of dummy variables depends on the number

of different categories such that it requires K-1 dummy variables if there are Kcategories in a feature to make the data matrix to be invertible matrix, avoidingthe dummy variable trap (James H Stock 2020, p 230)

We ended up with 78 explanatory features As we have a limited number oflistings in our dataset, we attempted to solve this problem by using K-Fold CrossValidation for model selection since this method is considered to be useful whenthe number of records is low (Bishop 2006, p 32)

Figure 3.2: The technique of K-Fold Cross Validation with K=4 (Bishop 2006,

p 33)

Trang 13

The method involves splitting the dataset into K different groups Then K-1groups are used to train a specific model which are evaluated by the remaininggroup The last step is then repeated for K times until K specific groups are testedindividually Finally, the performance score of a model, which is discussed in thesection below, is the average of the scores from K runs

A major drawback of this technique is that it is computationally expensive as

a model is required to train and test K times This issue is critical in our case

as there are machine learning algorithms that have a plenty number of parameters with different combinations required to be tested For instance, thereare more than 10 hyper-parameters to be tuned in Extreme Gradient Boosting,which is infeasible to use K-Fold Cross Validation for all of the compositions.Therefore, we only tuned some parameters supposed to have vital impacts on themodel performance while leaving the others in default values set by its package

In this project, we used several machine learning algorithms with different sets ofparameters to build a price predictor In order to choose the best candidate forthis task, there needs to be some metrics that assess how those models perform.The performance is quantified by showing how close the predicted value of a givenobservation is to the true value of that observation For a regression problem, themost commonly-used metric is mean squared error (MSE) (Hastie 2017, p 29),which is given by,

to the above formula is not bounded to any range The smallest MSE is 0, theresult of a model with perfect predictions and we know that it is nearly impossible

Trang 14

to have that in reality Therefore, by choosing the smallest MSE among our models

we do not if that model can become a real tool for a price suggestion practically.Thus, it is where the R2 statistic comes in as an alternative measure

The R2 statistic shows the fraction of variance in the target that can be dicted using the features (James H Stock 2020, p 153) Then metrics always takes

pre-on a values between 0 and 1, where an R2 near 0 provides a model with a badaccuracy while an R2 close to 1 provides a model good at predicting the target.The formula of this metrics is given by,

R2 = 1 −

Pn i=1(yi− ¯y)2

Pn i=1(yi− ˆf (xi))2 (3.2)where ¯y is the mean of the target that we try to predict Additionally, theformula can also be derived into this

R2= 1 −

P n i=1 (y i −¯ y)2n

P n i=1 (y i − ˆ f (x i )) 2

n

in which we can write like this regard to (3.1)

R2= 1 − MSE of a model

MSE of the mean of data (3.3)

As the MSE gets smaller toward 0, the R2 gets bigger toward 1 Therefore, wecan interpret that the R2 is a rescaling of the MSE This is the reason for us tochoose the R2 as the main metrics for model selection as its intuitive scale appears

to be better descriptively

Trang 15

to find the solution of this following problem,

l1 and l2norms respectively (AppendixA) The problem above is indifferentible so

we cannot apply the common algorithm for regression models Gradient Descent inorder to compute LASSO However, there have been various mathematical theories

to compute the solution of the LASSO, including Coordinate Descent (Tibshirani

et al 2010) and Least Angle Regression (Efron et al 2004), which are installed inpopular machine learning package such as Scikit Learn 1 Instead of using thosetwo in a pre-written package, we attempted to write an alternative algorithm,FISTA algorithm (Beck & Teboulle 2009), to solve the LASSO problem ourselves

1 LASSO User Guide on Scikit-learn Document

9

Trang 16

X = arg min

X∈R d{f (X) + g(X)} (4.2), where:

• g : Rn7→ R is a continuous convex function, which is possibly non-smooth ,i.e, indifferentible

• g : Rn7→ R is a smooth convex function with Lipschitz continuous graidentL(f )

– Lipschitz constant: a function (f ) such that

k∇f (x) − ∇f (y)k ≤ L(f )kx − yk, for all x, y ∈ Rn, then L(f ) is aLipschitz constant of ∇f

FISTA can be used in many problems related to4.2 LASSO is among the bestknown Hence, we can apply this algorithm to solve the following Loss function ofLASSO

Trang 17

kXT(Xa − Y ) − XT(Xb − Y )k

Thus, we factorise the common term XTX: kXTX(a − b)k Applying this norminequality kA(a − b)k ≤ kAkka − bk(Benning 2019), we have the Lipschitz constantL(f ) found that is L = kXTXk

FISTA is a refined version of Iterative Shrinkage-Thresholding Algorithm (ISTA)

in which both methods seek to find the solution of this proximal function (Beck &Teboulle 2009):

PL(v) = arg min

w

g(w) +L

2kw − zk

2

With some steps using calculus, we get this result

Stau(z) = sign(z)(|z| − τ )+

Trang 18

as below

Algorithm 1 FISTA with constant step size

Input L = L(f ) − A Lipschitz constant of ∇f , λ: regularized parameterInitialise: w0 ∈ Rn, v1 = w0, t1 = 1

Iterate:

for k = 1, ,K − 1 do

compute zk= vk− 1

L∇f (v)compute wk = Sλ

L (z k )

compute tk+1 = 1+

√

1+4t 2 k

2

compute vt+1 = wk+tk −1

t k+1(wk− wk−1)end for

return wK

Random Forest (Breiman 2001) is an ensemble method that uses bagging nique The design of ensemble learning is to construct a prediction model by

Trang 19

tech-CHAPTER 4 METHODS 13

applying a multiple machine learning algorithms in order to achieve a better dictive power rather than using those algorithm alone (Trevor Hastie 2009, p 605).Bagging (or Bootstrap Aggregating) is is to average the results of models in the en-semble equally This technique trains each model in the ensemble using a bootstrapsample, a subset that is randomly drawn from the training dataset (Trevor Hastie

pre-2009, p 282), thereby reducing the variance The Random Forest algorithm tains a collection of decision trees Its output is the class that has the most votes forclassification problem or the mean prediction of the individual trees for regression

con-Figure 4.1: Random Forest structure (Source)The Random Forest Regression in our study operates through this followingalgorithm (Trevor Hastie 2009, p 588):

Trang 20

CHAPTER 4 METHODS 14

Algorithm 2 Random Forest for Regression

Iterate:

for b = 1, ,B do

1 Draw of boostrap sample Z of size N from the training data

2 Grow a decision tree Tb to the boostrapped data, by recursivelyrepeating the following steps for each terminal node of the tree, untilthe minimum node size nmin is reached,

i Select m variables at random from the p variables

ii Pick the best variable/split-point among the m using MeanSquared Errors

iii Split the node into two daughter nodes

end for

Output the ensemble of trees TbB1

return The prediction of a new point x: B1 PB

1 Tb(x)

The construct of Decision Tree is referred to the original paper for more details(Breiman et al 1984) Nonetheless, the Random Forest algorithm only uses alimited number, which is less than the total amount, of features selected randomly

to decide the candidate to split a node This helps remove the problem that theensemble over-relies on an individual feature and have a fair use of all features,making the model more robust

Gradient Boosting is another form of ensemble method that applies boosting nique The boosting method is to combine the predictions of many weak learner

tech-to generate a powerful ”committee” Different from Random Forest, which builds

a forest of decision trees simultaneously, Gradient Boosting generate trees tially, each of which is to improve the errors made by the previous trees in theseries

sequen-The Gradient Boosting Regression algorithm is as follow (Friedman 2002):

Trang 21

CHAPTER 4 METHODS 15Algorithm 3 Gradient Boosting Regression

Initialise: f0(x) = arg minγPN

i=1L(yi, γ), learning rate νfor m = 1, ,M do

4 Update fm(x) = fm−1(x) + νPJ m

j=1γI(x ∈ Rjm)end for

return ˆf (x) = fM(x)

The Algorithm 3 is indicated by the choice of of the loss function L(y, f (x))

In this study, our choice of the loss criteria is the ’least-squares’ L(y, f (x)) =

1

2(yi− f (xi))2 By optimising this function, we can find the first model, which is

a single terminal node tree showing the mean of the target y in the training set.Moreover, the negative gradient of the loss function computed in each iteration

is called pseudo residual r The succeeding trees are built based on this paper(Breiman et al 1984) Thus, each tree corrects the mistakes of the previous trees.The corrections is scaled by the learning rate ν to avoid the problem of highvariance, increasing the robustness

Trang 22

CHAPTER 4 METHODS 16

Extreme Gradient Boosting (XGBoost) (Chen & Guestrin 2016) is another ing method that is applied in this study XGBoost is a variant of Gradient Boostingand this technique is architected with a different algorithm on how the embedeedtrees are structured such as changing the splitting method

boost-For a given dataset with n examples and m features D = {(xi, yi}(|D| = n, xi∈

Rm, yi ∈ R), the method objective is to solve the given function:

w is a vector of scores on a leaf Here ˆyi(t) is the prediction of the i-th instance atthe t iteration and l represents the loss function to be determined We chose thecommon Mean Squared Errors as the loss function in this case

In XGBoost, the second-order approximation of Taylor series is applied for theloss function for computational efficiency:

l(yi, ˆyt−1i + wj) ≈ l(yi, ˆyit−1) + giwj +1

2hiw

2 j

where gi = ∂yt−1l(yi, ˆyit−1) and hi = ∂y2t−1l(yi, ˆyt−1i are the first and second ordergradients of the loss function The term g and h are inspired to the facts thatthe first-order of a function is often called Gradient and the second-order is oftencalled Hessian respectively

Remove the constant l(yi, ˆyt−1i ) of the approximation, the objective functionbecomes:

Định dạng
Số trang	44
Dung lượng	1,37 MB