To contribute the growing inter-est and immense literatureon applying Artificial Intelligence on predicting rentalprices, this paper attempts to build ma-chine learning models for that p
Trang 1Data Analytics MSc Dissertation MTH775P, 2019/20
Disquisitiones Arithmeticæ Predicting the prices for breakfasts and beds
Hai Nam Nguyen, ID 161136118
Supervisor: Dr Martin Benning
A thesis presented for the degree of
Master of Science in Data Analytics
School of Mathematical Sciences Queen Mary University of London
Trang 2Declaration of original work
This declaration is made on August 17, 2020.Student’s Declaration: I Student Name hereby declare that the work
in this thesis is my original work I have not copied from any other students’work, work of mine submitted elsewhere, or from any other sources exceptwhere due reference or acknowledgement is made explicitly in the text, norhas any part been written for me by another person
Referenced text has been flagged by:
1 Using italic fonts, and
2 using quotation marks “ ”, and
3 explicitly mentioning the source in the text
i
Trang 3This work is dedicated to my niece Nguyen Le Tue An(Mochi), who has
brought a great source of joy to me and my family recently
Trang 4Pricing and guessing the right prices are vital for both hosts and renters on sharing plat-form from internet based companies To contribute the growing inter-est and immense literatureon applying Artificial Intelligence on predicting rentalprices, this paper attempts to build ma-chine learning models for that purposeusing the Luxstay listings in Hanoi R2 score is used as the main criterion for themodel performance and the results show that Extreme GradientBoostings (XGB)
home-is the model with the best performance with R2= 0.62, beating the most phisticated machine learning model: Neural Networks
so-iii
Trang 53.1 Dataset 5
3.2 K-Fold Cross Validation 6
3.3 Measuring Model Accuracy 7
4 Methods 9 4.1 LASSO 9
4.1.1 FISTA 10
4.2 Random Forest 12
4.3 Gradient Boosting 14
4.4 Extreme Gradient Boosting 16
4.5 LightGBM 19
4.5.1 Gradient-based One-sided Sampling 20
4.5.2 Exclusive Feature Bundling 20
4.6 Neural Networks 23
4.6.1 Adam Algorithm 25
4.6.2 Backpropagation 26
iv
Trang 6CONTENTS v
A Some special mathematical notations 32
A.1 Vector Norm 32
A.2 The Hadamard product 33
Trang 7Chapter 1
Introduction
Since its establishment in 2016, Luxstay has become one of the most popular forms on home-sharing along with Airbnb in Vietnam with a network of more than15,000 listings The platform connects the guests’ demand to rent villas, houses,apartment, to hosts and vice versa Hence, providing a reasonable price willhelp hosts to gain a high and stable income and guests will get great experiences
plat-in new places Therefore, workplat-ing on the sensible predictor and suggestion ofLuxstay prices can generate a real-life value and practical application
Hanoi is the capital of Vietnam and has the second most listings on Luxstay.The city has been also ranked in top 10 destinations to visit by TripAdvisor As adynamic city with active bookings and listings, Hanoi can be a great example forthe study of Luxstay Pricing
In this paper, we build a price prediction model and compare the performance
of different methods using R2 as the main measure The input of our models isthe data scraped on the Hanoi page of the website which includes continuous andcategorical records about listings Then a number of methods including traditionalMachine Learning models (LASSO, random forest, gradient boosting), ExtremeGradient Boosting, LightGBM and neural network to predict prices of listings
1
Trang 8Chapter 2
Literature Review
The sharing economy is a socio-economic system that arranges ”the based activity of obtaining, giving, or sharing the access to goods and services”through ”community-based online services” (J Hamari 2015) Home-sharing is one
peer-to-peer-of the sharing activities and it has been experienced a significant growth due to
a high demand from tourism (Guttentag 2015) Given that Luxstay is a startupfrom an emerging economy, the platform has not received much attention fromthe academic community as well as Airbnb, the leading company for this service(Wang & Nicolau 2017) Nevertheless, the Vietnamese home-sharing platformshas some similar characteristics to Airbnb as it is also an internet-based companythat coordinates the demand of short-term renters and hosts Therefore, it isworth to conduct a review some findings on Airbnb from recent papers
Gibbs et al (2017) stated that one of the biggest challenges of Airbnb waspricing the right prices by identifying the two key reasons for this issue Firstly,unlike the hotel business, where the prices are set by trained experts and industrybenchmarks, rental prices on Airbnb are normally determined by regular hots withlimited supports Secondly, instead of letting algorithm to control prices like Uberand Lyft, Airbnb leaves the prices to hosts to decide given that they might not
be well-informed Consequently, these two factors may lead to cause a ptentiallyfinancial loss and empirical evidence shows that incompetent pricing causes a loss
of 46% of additional revenue on Airbnb Hence, there have been an interest in the
2
Trang 9CHAPTER 2 LITERATURE REVIEW 3
study of rental price prediction on the leading platform The two trends for thistopic are hedonic-based regression and artificial intelligence techniques
The term Hedonic is defined to describe ”the weighting of the relative tance of various components among others in constructing an index of usefulnessand desirability” (Goodman 1998) In other words, Hedonic pricing is to identifiesfactors and characteristics affecting an item price (Investopedia.com) Wang &Nicolau (2017) aimed to design a system to understand which features are im-portant input for an automated price suggestion on Airbnb using hedonic-baseregression approach The functional form used were Ordinary Least Squares andQuantile Regression to analyse 25 variables of 180,533 listings in 33 cities Theresult shows that features related to host attributes such as the number of theirlistings and the profile pictures are the most important features Among those,super host status, which reveals experienced hosts on the platform, is the best one.However, the authors also discussed the limitation of this analysis The approach
impor-is under some economic assumptions needed to be examined The assumption ofhosts’ rationality requires a qualitative check which is skipped in the study Gener-ally, the effectiveness of hedonic-based regression for price prediction is restricted
by the model assumptions and esimation (Selim 2009)
Another approach for price prediction is to apply artificial intelligence niques which mainly includes machine learning an neural network models Tang
tech-& Sangani (2015) produced a model fore price prediction for San Francisco ings To reduce the complexity of the task, they turned the regression probleminto a classification task that predict both the neighbour hood and price range of
list-a listing list-and Support Vector Mlist-achine wlist-as the mlist-ain model to be tuned Uniquely,they included images as inputs for the model by creating a visual dictionary tocategorise the image of a listing The result shows that while the price predictionachieves a high accuracy in the test set at 81.2%, the neighbourhood predictionsuffers from overfitting with a big gap between the train and test sets Alterna-tively, Cai & Han(2019) attempted to work on the regression problem using thelistings in Melbourne The study implemented l1 regularisation as feature selec-tion for all traditional machine learning methods and then compared to modelswithout it The result shows that the latter perform better overall and gradient
Trang 10CHAPTER 2 LITERATURE REVIEW 4
boosting algorithm produces the best precision with R2= 0.6914 in the test set.Recently, another study of the listings in New York holds an interesting resultwith an highest R2 of 0.7768 (Kalehbasti et al 2019) To gain that score, theyperformed a logarithmic transformation to the prices and then train their mod-els Additionally, they also attempted to compare three feature selection methods,which are manual selection, p-value and LASSO The analysis shows that p-valueand LASSO outperformed manual selection and the best method to be applied inthe paper is LASSO
In this paper, we applied the knowledge of the last three studies to build ourprice predictor for the listings on Luxstay Apart from widely used traditionalmachine learning methods and neural networks, we also attempted to code analgorithm to compute LASSO regression ourselves and used the two recent gradi-ent boosting technique, Extreme Gradient Boosting and LightGBM The projectworked on the original rental prices to produce a price prediction without anylogarithmic transformation
Trang 11Chapter 3
Experimental Design
Figure 3.1: Example of Luxstay Listings
Our dataset of Luxstay listings was scraped using BeautifulSoup package onPython (Richardson 2007) It includes 2675 listings posted in Hanoi on 27 De-cember, 2019 Each listing contains fields describing the offered price (in dollar),
5
Trang 12CHAPTER 3 EXPERIMENTAL DESIGN 6
district, type of home, name of its building, numbers of guests allowed, bedroomsand bathrooms
In order to make the dataset become available inputs for machine learningmodels, we went through few pre-processing steps Firstly, we droped features thatare not related to the prices such as listing id, listing name and listing link.Secondly, we used dummy variable encoding to solve the issue with categoricalfeatures which some machine learning algorithms can not work with directly Acategorical variable is a variable that assign an observation to a specific group ornominal category on the basis of some qualitative property (Yates et al 2003) Adummy variable is a binary variable that stores values of 0 and 1 where the formerrepresents the absence of a category and the latter shows the presence (James
H Stock 2020, p 186) The number of dummy variables depends on the number
of different categories such that it requires K-1 dummy variables if there are Kcategories in a feature to make the data matrix to be invertible matrix, avoidingthe dummy variable trap (James H Stock 2020, p 230)
We ended up with 78 explanatory features As we have a limited number oflistings in our dataset, we attempted to solve this problem by using K-Fold CrossValidation for model selection since this method is considered to be useful whenthe number of records is low (Bishop 2006, p 32)
Figure 3.2: The technique of K-Fold Cross Validation with K=4 (Bishop 2006,
p 33)
Trang 13CHAPTER 3 EXPERIMENTAL DESIGN 7
The method involves splitting the dataset into K different groups Then K-1groups are used to train a specific model which are evaluated by the remaininggroup The last step is then repeated for K times until K specific groups are testedindividually Finally, the performance score of a model, which is discussed in thesection below, is the average of the scores from K runs
A major drawback of this technique is that it is computationally expensive as
a model is required to train and test K times This issue is critical in our case
as there are machine learning algorithms that have a plenty number of parameters with different combinations required to be tested For instance, thereare more than 10 hyper-parameters to be tuned in Extreme Gradient Boosting,which is infeasible to use K-Fold Cross Validation for all of the compositions.Therefore, we only tuned some parameters supposed to have vital impacts on themodel performance while leaving the others in default values set by its package
In this project, we used several machine learning algorithms with different sets ofparameters to build a price predictor In order to choose the best candidate forthis task, there needs to be some metrics that assess how those models perform.The performance is quantified by showing how close the predicted value of a givenobservation is to the true value of that observation For a regression problem, themost commonly-used metric is mean squared error (MSE) (Hastie 2017, p 29),which is given by,
to the above formula is not bounded to any range The smallest MSE is 0, theresult of a model with perfect predictions and we know that it is nearly impossible
Trang 14CHAPTER 3 EXPERIMENTAL DESIGN 8
to have that in reality Therefore, by choosing the smallest MSE among our models
we do not if that model can become a real tool for a price suggestion practically.Thus, it is where the R2 statistic comes in as an alternative measure
The R2 statistic shows the fraction of variance in the target that can be dicted using the features (James H Stock 2020, p 153) Then metrics always takes
pre-on a values between 0 and 1, where an R2 near 0 provides a model with a badaccuracy while an R2 close to 1 provides a model good at predicting the target.The formula of this metrics is given by,
R2 = 1 −
Pn i=1(yi− ¯y)2
Pn i=1(yi− ˆf (xi))2 (3.2)where ¯y is the mean of the target that we try to predict Additionally, theformula can also be derived into this
R2= 1 −
P n i=1 (y i −¯ y)2n
P n i=1 (y i − ˆ f (x i )) 2
n
in which we can write like this regard to (3.1)
R2= 1 − MSE of a model
MSE of the mean of data (3.3)
As the MSE gets smaller toward 0, the R2 gets bigger toward 1 Therefore, wecan interpret that the R2 is a rescaling of the MSE This is the reason for us tochoose the R2 as the main metrics for model selection as its intuitive scale appears
to be better descriptively
Trang 15to find the solution of this following problem,
l1 and l2norms respectively (AppendixA) The problem above is indifferentible so
we cannot apply the common algorithm for regression models Gradient Descent inorder to compute LASSO However, there have been various mathematical theories
to compute the solution of the LASSO, including Coordinate Descent (Tibshirani
et al 2010) and Least Angle Regression (Efron et al 2004), which are installed inpopular machine learning package such as Scikit Learn 1 Instead of using thosetwo in a pre-written package, we attempted to write an alternative algorithm,FISTA algorithm (Beck & Teboulle 2009), to solve the LASSO problem ourselves
1 LASSO User Guide on Scikit-learn Document
9
Trang 16X = arg min
X∈R d{f (X) + g(X)} (4.2), where:
• g : Rn7→ R is a continuous convex function, which is possibly non-smooth ,i.e, indifferentible
• g : Rn7→ R is a smooth convex function with Lipschitz continuous graidentL(f )
– Lipschitz constant: a function (f ) such that
k∇f (x) − ∇f (y)k ≤ L(f )kx − yk, for all x, y ∈ Rn, then L(f ) is aLipschitz constant of ∇f
FISTA can be used in many problems related to4.2 LASSO is among the bestknown Hence, we can apply this algorithm to solve the following Loss function ofLASSO
Trang 17kXT(Xa − Y ) − XT(Xb − Y )k
Thus, we factorise the common term XTX: kXTX(a − b)k Applying this norminequality kA(a − b)k ≤ kAkka − bk(Benning 2019), we have the Lipschitz constantL(f ) found that is L = kXTXk
FISTA is a refined version of Iterative Shrinkage-Thresholding Algorithm (ISTA)
in which both methods seek to find the solution of this proximal function (Beck &Teboulle 2009):
PL(v) = arg min
w
g(w) +L
2kw − zk
2
With some steps using calculus, we get this result
Stau(z) = sign(z)(|z| − τ )+
Trang 18as below
Algorithm 1 FISTA with constant step size
Input L = L(f ) − A Lipschitz constant of ∇f , λ: regularized parameterInitialise: w0 ∈ Rn, v1 = w0, t1 = 1
Iterate:
for k = 1, ,K − 1 do
compute zk= vk− 1
L∇f (v)compute wk = Sλ
L (z k )
compute tk+1 = 1+
√
1+4t 2 k
2
compute vt+1 = wk+tk −1
t k+1(wk− wk−1)end for
return wK
Random Forest (Breiman 2001) is an ensemble method that uses bagging nique The design of ensemble learning is to construct a prediction model by
Trang 19tech-CHAPTER 4 METHODS 13
applying a multiple machine learning algorithms in order to achieve a better dictive power rather than using those algorithm alone (Trevor Hastie 2009, p 605).Bagging (or Bootstrap Aggregating) is is to average the results of models in the en-semble equally This technique trains each model in the ensemble using a bootstrapsample, a subset that is randomly drawn from the training dataset (Trevor Hastie
pre-2009, p 282), thereby reducing the variance The Random Forest algorithm tains a collection of decision trees Its output is the class that has the most votes forclassification problem or the mean prediction of the individual trees for regression
con-Figure 4.1: Random Forest structure (Source)The Random Forest Regression in our study operates through this followingalgorithm (Trevor Hastie 2009, p 588):
Trang 20CHAPTER 4 METHODS 14
Algorithm 2 Random Forest for Regression
Iterate:
for b = 1, ,B do
1 Draw of boostrap sample Z of size N from the training data
2 Grow a decision tree Tb to the boostrapped data, by recursivelyrepeating the following steps for each terminal node of the tree, untilthe minimum node size nmin is reached,
i Select m variables at random from the p variables
ii Pick the best variable/split-point among the m using MeanSquared Errors
iii Split the node into two daughter nodes
end for
Output the ensemble of trees TbB1
return The prediction of a new point x: B1 PB
1 Tb(x)
The construct of Decision Tree is referred to the original paper for more details(Breiman et al 1984) Nonetheless, the Random Forest algorithm only uses alimited number, which is less than the total amount, of features selected randomly
to decide the candidate to split a node This helps remove the problem that theensemble over-relies on an individual feature and have a fair use of all features,making the model more robust
Gradient Boosting is another form of ensemble method that applies boosting nique The boosting method is to combine the predictions of many weak learner
tech-to generate a powerful ”committee” Different from Random Forest, which builds
a forest of decision trees simultaneously, Gradient Boosting generate trees tially, each of which is to improve the errors made by the previous trees in theseries
sequen-The Gradient Boosting Regression algorithm is as follow (Friedman 2002):
Trang 21CHAPTER 4 METHODS 15Algorithm 3 Gradient Boosting Regression
Initialise: f0(x) = arg minγPN
i=1L(yi, γ), learning rate νfor m = 1, ,M do
4 Update fm(x) = fm−1(x) + νPJ m
j=1γI(x ∈ Rjm)end for
return ˆf (x) = fM(x)
The Algorithm 3 is indicated by the choice of of the loss function L(y, f (x))
In this study, our choice of the loss criteria is the ’least-squares’ L(y, f (x)) =
1
2(yi− f (xi))2 By optimising this function, we can find the first model, which is
a single terminal node tree showing the mean of the target y in the training set.Moreover, the negative gradient of the loss function computed in each iteration
is called pseudo residual r The succeeding trees are built based on this paper(Breiman et al 1984) Thus, each tree corrects the mistakes of the previous trees.The corrections is scaled by the learning rate ν to avoid the problem of highvariance, increasing the robustness
Trang 22CHAPTER 4 METHODS 16
Extreme Gradient Boosting (XGBoost) (Chen & Guestrin 2016) is another ing method that is applied in this study XGBoost is a variant of Gradient Boostingand this technique is architected with a different algorithm on how the embedeedtrees are structured such as changing the splitting method
boost-For a given dataset with n examples and m features D = {(xi, yi}(|D| = n, xi∈
Rm, yi ∈ R), the method objective is to solve the given function:
w is a vector of scores on a leaf Here ˆyi(t) is the prediction of the i-th instance atthe t iteration and l represents the loss function to be determined We chose thecommon Mean Squared Errors as the loss function in this case
In XGBoost, the second-order approximation of Taylor series is applied for theloss function for computational efficiency:
l(yi, ˆyt−1i + wj) ≈ l(yi, ˆyit−1) + giwj +1
2hiw
2 j
where gi = ∂yt−1l(yi, ˆyit−1) and hi = ∂y2t−1l(yi, ˆyt−1i are the first and second ordergradients of the loss function The term g and h are inspired to the facts thatthe first-order of a function is often called Gradient and the second-order is oftencalled Hessian respectively
Remove the constant l(yi, ˆyt−1i ) of the approximation, the objective functionbecomes: