PROJECT REPORT_Hotel customer reviews analysis ( Full)

Liên hệ: 0868442806 Để được giảm 25% giá tài liệu Introduction In an era where digital travel planning is becoming increasingly important, the analysis of hotel reviews is becoming more and more relevant. This project addresses the challenge of gaining valuable insights from an abundance of hotel reviews. By automatically collecting around 20,000 individual reviews via the Google Travel website, we provide a comprehensive insight into the needs and expectations of travelers. The immense amount of information available, especially in the form of unstructured text data from hotel reviews, holds a potential that has so far remained largely untapped. This data contains valuable information about the customer experience, the quality of services and the strengths and weaknesses of hotels. In order to fully exploit this potential, we have supplemented the automatic data collection with advanced text understanding models. The main goal of this project is not only to collect data, but to understand it in depth and breadth. By applying advanced text comprehension models and comprehensive data analysis, we aim to paint a clear picture of customer reviews. This precise interpretation will allow companies to not only capture the tone of their customers, but also derive concrete actions to optimize their services and maintain a positive customer experience. For companies in the tourism sector manifests itself in the possibility of gaining indepth insights into customer opinions. By precisely analyzing the collected data, companies can emphasize their strengths, address weaknesses and make targeted improvements. This is not only in response to past reviews, but also as a proactive approach to future customer expectations. At a time when customer loyalty is heavily influenced by online reviews, this project offers businesses the opportunity to strengthen their online reputation and gain a clear competitive advantage. By understanding the data collected, companies can deploy their resources more effectively and continuously adapt their services to the needs of their customers. In the following sections, you will dive into the intricacies of our data preparation and collection process, where a detailed analysis of the user rating scheme is presented. Then we provide a comprehensive overview of our data cleansing procedures, where we also discuss the more detailed cleansing of review texts. Finally, we provide an indepth analysis of the cleansed data, supported by informative visuals. As an added feature, we conclude the report with a sentiment analysis for the ratings, which provides additional insight into user feedback.

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

- -🕮🙢 -

PROJECT REPORT Hotel customer reviews analysis

Instructor: Tran Viet Trung Group: 16

Member: Ngo Quang Viet 20194881

Yannic Elias Hanel 2023T039

Ayoub Ala Mostafa 2023T011

Hà Nội, 2023

Trang 2

Table of content

Introduction

1 Data preparation

1.1 Business analysis

1.2 Data collection

1.3 Data understanding

2 Data cleaning and preprocessing

2.1 Handling missing values and duplications

2.2 Datatypes

2.3 One-hot encoding

2.4 Cleaning text data

3 Exploratory Data Analysis

3.1 Univariate analysis

3.2 Bivariate analysis

3.3 Multivariate analysis

3.4 Text analysis

3.4.1 Text length

3.4.2 Common words

3.4.3 Sentiment analysis

4 Sentiment Classification with Machine Learning model

4.1 Overview

4.2 Prepare data

4.3 Feature extraction

4.3.1 BoW features

4.3.2 TF-IDF features

4.4 Model training

4.5 Model evaluation

Conclusion

Trang 3

In an era where digital travel planning is becoming increasingly important, theanalysis of hotel reviews is becoming more and more relevant This project addresses thechallenge of gaining valuable insights from an abundance of hotel reviews By automaticallycollecting around 20,000 individual reviews via the Google Travel website, we provide acomprehensive insight into the needs and expectations of travelers

The immense amount of information available, especially in the form of unstructured textdata from hotel reviews, holds a potential that has so far remained largely untapped Thisdata contains valuable information about the customer experience, the quality of servicesand the strengths and weaknesses of hotels In order to fully exploit this potential, we havesupplemented the automatic data collection with advanced text understanding models.The main goal of this project is not only to collect data, but to understand it in depth andbreadth By applying advanced text comprehension models and comprehensive dataanalysis, we aim to paint a clear picture of customer reviews This precise interpretation willallow companies to not only capture the tone of their customers, but also derive concreteactions to optimize their services and maintain a positive customer experience

For companies in the tourism sector manifests itself in the possibility of gaining in-depthinsights into customer opinions By precisely analyzing the collected data, companies canemphasize their strengths, address weaknesses and make targeted improvements This isnot only in response to past reviews, but also as a proactive approach to future customerexpectations At a time when customer loyalty is heavily influenced by online reviews, thisproject offers businesses the opportunity to strengthen their online reputation and gain aclear competitive advantage By understanding the data collected, companies can deploytheir resources more effectively and continuously adapt their services to the needs of theircustomers

In the following sections, you will dive into the intricacies of our data preparation andcollection process, where a detailed analysis of the user rating scheme is presented Then

we provide a comprehensive overview of our data cleansing procedures, where we alsodiscuss the more detailed cleansing of review texts Finally, we provide an in-depth analysis

of the cleansed data, supported by informative visuals As an added feature, we concludethe report with a sentiment analysis for the ratings, which provides additional insight intouser feedback

Trang 4

1 Data preparation

1.1 Business analysis

This data science project centers around mining hotel reviews to discern the factorsthat resonate with customers The goal extends beyond identifying surface-levelpreferences; it encompasses delving into the finer details that shape a guest's perception

Rather than just cataloging customer likes, the focus is on unraveling the underlying reasonsbehind their preferences It goes beyond recognizing, for instance, a fondness forcomfortable beds, seeking to comprehend how these seemingly small details contribute to

an overall positive guest experience

The significance of this initiative transcends individual hotels; it aspires to provide valuableinsights to the broader hospitality industry The findings serve as a strategic tool, offeringforesight into emerging trends and ensuring that hotels are not just keeping pace butstaying ahead in meeting evolving guest expectations It's akin to a strategic guide,empowering hotels to proactively enhance guest satisfaction and continually elevate theirstandards

1.2 Data collection

For this analysis, we collect hotel information from 200 hotels across 5 locations:Birmingham, Edinburgh, Liverpool, London and Manchester, as well as their users’ reviews:

100 for each hotel All hotel information and reviews are collected from Google Travel

Hotel information schema is as follows:

source str Website which the hotel

information was collected from (In

No

Trang 5

this case, it’s only “Google”).

images_count int Number of photos submitted by

hotel’s owner

No

popular_amenities list[str] List of popular amenities No

Users’ review schema is as follows:

rating float Rating score (scale of 5) No

review_timestamp datetime Timestamp when the review was

trip_companions str With whom the reviewer traveled

with (Possible values: “Family”,

“Friends”, “Couple”, “Solo”

Yes

The crawling workflow is as follows:

- First, crawl a list of URLs to a hotel’s details page, by executing

`get_hotel_list.py` script:

python3 get_hotel_list.py [-h] limit LIMIT output OUTPUT] [ headless | no-headless]locations [locations ]

[ List of parameters:

Trang 6

–-limit LIMIT Yes Number of maximum hotels by a location

output OUTPUT No Path to write the URL list (defaults to stdout, or

console if not specified)–-headless, –-

no-headless No Whether to run the script with a headless browser or not, suitable for debugging

Defaults to headless mode

- Then, from the list of URLs acquired from the above step, we proceed to gethotel details as well its reviews in separate flows: One for collecting hoteldetails, one for collecting hotel’s reviews

To retrieve a list of hotel details (in CSV format), use get_hotel_details.pyscript:

python3 get_hotel_details.py [-h] output OUTPUT] headless | no-headless] input

[ List of parameters:

-h No Displays the help message and exit

input Yes Path to file containing list of URLs collected

from the previous step output OUTPUT No Path to output file, defaults to a csv file

containing the timestamp the script starts headless,

If your computer/runner is powerful enough, it is advised that you perform a fewbatches at a time, the processes should not interfere with one another

To retrieve a list of hotel reviews (in CSV format), use get_hotel_reviews.py script:python3 get_hotel_reviews.py [-h] [ output OUTPUT] limit LIMIT [ headless | no-headless] input

List of parameters:

Trang 7

Parameter Required? Description

-h No Displays the help message and exit

input Yes Path to file containing list of URLs collected

from the previous step output OUTPUT No Path to output file, defaults to a csv file

containing the timestamp the script starts–-limit LIMIT Yes Number of maximum reviews by a hotel

headless,

1.3 Data understanding

In the data understanding phase of this project, our focus is on gaining insights intothe collected data, particularly the hotel reviews This involves exploring, examining, andcomprehending the structure and content of the dataset The primary objectives are toidentify patterns, trends, and potential challenges within the data, paving the way for moreinformed analysis and interpretation

Overview of Ratings Distribution :

We begin by examining the distribution of ratings across all hotel reviews Understandingthe distribution helps us identify whether there's a skew towards positive or negativesentiments

Text Length Analysis :

Analyzing the length of review texts can provide insights into customers' engagement levels

We explore the distribution of text lengths to understand if there's a correlation betweenreview length and the assigned rating

Hotel-wise Analysis :

We conduct a detailed analysis of ratings, review lengths, and sentiments for each hotelindividually This allows us to identify specific patterns and variations unique to each

Trang 8

Sentiment Analysis :

Utilizing sentiment analysis models, we aim to categorize each review as positive, negative,

or neutral This step is crucial in understanding the overall sentiment of customers towardsthe hotels

Topic Modeling :

Applying topic modeling techniques, we extract key themes and topics present in thereviews This helps in understanding the major factors influencing customer opinions

Handling Unstructured Text :

Given that the data primarily consists of unstructured text, we address challenges related tonatural language processing (NLP), including tokenization, stemming, and lemmatization

Trang 9

2 Data cleaning and preprocessing

2.1 Handling missing values and duplications

The table below shows the count and percentage of null values in various

columns of the dataset From what is described, the trip_type column has a significant number of missing values, roughly 49.83% of the data The trip_companions column also has

a high percentage of null values, around 45.51%

Given that almost half of the data for trip_type is missing, the strategy for handling

these null values is crucial Removing such a large portion of the dataset is likely notadvisable, as it would result in a significant loss of data Imputation might also bechallenging unless there are strong predictors for trip type within the data One-hot

encoding the trip_type column with an additional category for nulls might be the most

suitable approach here It would allow us to retain all the data and treat the missing values

as a separate unknown category.

For trip_companions, the approach would be similar due to the high percentage of missing values Including an unknown category could be beneficial for any predictive

modeling or analysis, as this allows the model to account for the fact that the information ismissing, which might itself be a pattern of interest

The review_text column has a relatively small percentage of missing values (0.166%).

This is a very small fraction of the dataset, which could be handled differently than the

trip_type and trip_companions columns with their substantially higher percentages of

missing data Since we plan to perform text analysis, especially sentiment analysis, thequality and completeness of the text data will be paramount, we decided to drop all recordsthat have null values in this column

Moreover, there are no duplications present within the dataset

2.2 Datatypes

We use `data.dtypes` command to list the data types of each column in the dataset,

as shown in figure below

The rating column is of type float64, indicating numerical values with decimal points, which is common for rating data The images_count column is an integer (int64), which is

appropriate for count data The `review_timestamp` column is also listed as an `object`,

Trang 10

indicating it has not been interpreted as a date or time format by Pandas.

Other columns are of the type object, which typically means they are strings or mixed types It is worth noting that the review_timestamp column is also listed as an object,

indicating it has not been interpreted as a date or time format by Pandas The conversion of

the review_timestamp to datetime type is a necessary step for analysis involving time series,

as it will enable functions such as resampling, time-based indexing, and extractingcomponents of the date like the month or day of the week The correct interpretation of thiscolumn will make temporal analyses and visualizations much easier and more efficient We

convert the review_timestamp column from an object type to a datetime type, which is

crucial for any time-series analysis This conversion will allow the use of Pandas' powerfultime-series functionality

2.3 One-hot encoding

One-hot encoding is a common technique used in data preprocessing to convertcategorical data into a numerical format that can be provided to machine learning

algorithms We will perform one-hot encoding on three categorical columns: trip_types,

trip_companions, and popular_amenities.

The trip_type column is one-hot encoded into three columns: type_Business,

type_Vacation, and type_unknown This indicates that there were originally two known

types of trips (Business and Vacation) and an additional encoding has been created for theunknown types, which represents the null values in the original data as discussed earlier

The trip_companions column has been transformed into five columns:

companions_Couple, companions_Family, companions_Friends, companions_Solo, and companions_unknown This suggests that there were four known categories for trip

Trang 11

companions, and similar to trip_types, an additional category has been made to represent

unknown or missing values

The popular_amenities column has been also one-hot encoded into several columns, each representing a particular amenity such as Air conditioning, Airport shuttle, Breakfast …

Since it is not a categorical value but represented as a string, which was originally a list, weneed to follow several operations to convert them into one-hot representation:

1 Cleans the popular_amenities data by removing list delimiters.

2 Converts the list of amenities into one-hot encoded format

3 Removes any leading or trailing spaces from the new column names

4 Combines any duplicate columns resulting from similar amenities

5 Merges the one-hot encoded amenities back into the original dataset

2.4 Cleaning text data

We implemented a Cleaner class with different methods for the purpose of cleaning

the review text in the dataset Each of them will be discussed as below

remove_usernames: This method is designed to eliminate any potential privacy

concerns or irrelevant information by removing user mentions that could be present in thereview text These mentions are typically prefixed with an '@' symbol and are not useful foranalyzing the sentiment or content of the review itself

clean_text: By removing digits and special characters, this method focuses on the

textual content of the reviews It helps standardize the text for analysis by ensuring thatonly alphabetic characters are considered, which is particularly useful for NLP tasks wherenumerical values and special characters are often noise

lower_text: Converting text to lowercase is a fundamental step in text normalization,

reducing the complexity of the text data and ensuring that words are treated the sameregardless of their case (e.g., "Hotel" and "hotel" are recognized as the same word)

remove_empty: This method cleans the dataset by removing any rows where the

review text is missing Such rows cannot contribute to text-based analysis and couldpotentially skew the results if not addressed

preprocessing: This overarching method applies a sequence of tokenization,

stopwords removal, and lemmatization to prepare the text for further analysis The goal is

to distill the text down to its most informative elements

tokenize: Breaking the text into tokens (typically words) is a preparatory step for

Trang 12

many NLP tasks It allows the application of further processing, such as removing stopwords

or applying lemmatization, on a word-by-word basis

remove_stopwords: Stopwords are commonly used words that usually have little

lexical content and often don't contribute to the overall meaning of a sentence (e.g., "the",

"is", "and") Removing them helps to focus on the more meaningful content of the text

lemmatize: Lemmatization is a more context-aware approach to reducing words to

their base or dictionary form than stemming It uses morphological analysis andunderstands the part of speech of a word, which can improve the quality of subsequentanalysis by ensuring that words are not incorrectly shortened

summarize: Although not directly used in the cleaning process, this method provides

a way to condense longer texts into a more manageable form while retaining the mostinformative content It's particularly useful when dealing with very long reviews as the input

of the transformer-based sentiment classification model we will use later in the EDA

Trang 13

3 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in understanding the datasets athand It involves summarizing the main characteristics of the data, often with visualmethods The objective of EDA is to see what the data can tell us beyond the formalmodeling or hypothesis testing task This phase is preparatory in nature, intending touncover the underlying structure, detect outliers, test assumptions, and generatehypotheses

3.1 Univariate analysis

In our endeavor to explore the hotel reviews dataset, we began with univariateanalysis, which focuses on individual variables This approach is particularly useful when weaim to summarize and find patterns in the data without considering interactions betweenvariables

We initiated our analysis by examining the 'rating' variable Ratings are the crux ofcustomer feedback and serve as a direct indicator of customer satisfaction

distribution of ratings :

Trang 14

This chart illustrates the frequency of ratings given on a 1 to 5 scale Observing thedistribution, it is evident that the rating of 5 is the most frequent, suggesting a high level ofsatisfaction among the raters The number of ratings decreases progressively from 4 to 1,with the least frequency for a rating of 2 This pattern indicates a skew towards higherratings and reflects a positive reception for the product or service being evaluated So itshows that the hotels from whom we collected the reviews provide good services and theclients are satisfied.

Trips companion:

In this bar chart, we can see the frequency of trips taken with different companions It isapparent that the majority of the data falls under 'unknown', indicating that for a largenumber of trips, the companion type is not specified Among the known categories, tripstaken by couples are the most frequent, followed by those taken with family, suggestingthat the destination or service is particularly popular with these groups Trips taken withfriends and solo trips are less frequent, with solo outings being the least common of theidentified categories

From our perspective, we might infer that the destination or hotel appeals more to couplesand families, potentially due to the amenities, activities, or atmosphere that resonate well

Tiêu đề	Hotel Customer Reviews Analysis
Tác giả	Ngo Quang Viet, Tran Tung Lam, Yannic Elias Hanel, Le Dam Quan, Ayoub Ala Mostafa
Người hướng dẫn	Tran Viet Trung
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Information and Communication Technology
Thể loại	Project Report
Năm xuất bản	2023
Thành phố	Hà Nội

Định dạng
Số trang	29
Dung lượng	716,33 KB
File đính kèm	BPF_wrk.zip (2 MB)