73 4.20 Positive distribution and skewness in target in housing data, land data, renting data respectively.. 74 4.21 Distribution and skewness after log transformation in price column in
Trang 1Vietnam National University
Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
GRADUATION THESIS
A MICROSERVICE-BASED DATA CRAWLING AND ANALYZING FOR REAL ESTATE WEBSITES IN VIETNAM
USING MACHINE LEARNING
Major: Computer Science
Council: Computer Science 2 (English Program) Instructor: Assoc Prof Quan Thanh Tho
Reviewer: Assoc Prof Bui Hoai Thang
Students: Pham Thi Mai - 1752335
Nguyen Ngo Chi Khang - 1752275 Nguyen Huu Nguyen - 1752036
Ho Chi Minh City, October 2021
Trang 2Declaration Of Authenticity
We declare that this research is our own work, conducted under the supervision andguidance of Assoc Prof Quan Thanh Tho The result of our research is legitimate and hasnot been published in any forms prior to this All materials used within this researched arecollected ourself by various sources and are appropriately listed in the references section
In addition, within this research, we also used the results of several other authors andorganizations They have all been aptly referenced
In any case of plagiarism, we stand by my actions and will be responsible for it Ho Chi Minhcity University of Technology therefore are not responsible for any copyright infringementsconducted within our research
Ho Chi Minh, July, 2021
AuthorsPham Thi MaiNguyen Ngo Chi KhangNguyen Huu Nguyen
Trang 3We are using this opportunity to express our gratitude to everyone who supported usduring our study and life We are thankful for their aspiring guidance, invaluably constructivecriticism and friendly advice
We offer my sincerest and deepest gratitude to my supervisor, Assoc Prof Quan ThanhTho, for his support and guidance We would like to give thanks to Ho Chi Minh City University
of Technology for giving us the opportunity to learn great lessons of theory and practical rience as well as many teachers and professors companioning with us during the curriculum.Finally, we recognize that this research would not have been possible without the supportfrom our families and from bottom of our hearts, we must acknowledge our parents withoutwhose love, encouragement and sacrifice, we would not have finished this thesis
Trang 4First of all, transactional application has been regarded as the profound facilitation of thesuccess of many websites A typical application contains three main components: frontend,backend, and database Database is designed for recording transactional data and representssome elements in real world The logic is then performed through mapping, data objects, toname but a few Finally, a visual interface is placed on top of application to illustrate necessities.Apart from it, for the purpose of analyzing, Data Warehouse is intended to use as the solution.The mission of a Data Warehouse is based on operational data to integrate, transform, extract,etc It is the business decision support system aimed for helping users have comprehensiveknowledge about business-affecting factors by making business reporting In problem ofanalyzing data, query performance is one of the most important metrics As the historicaldata grows up, there is a host of techniques used when performing complex query with smallresponse time
Another fact taken into account is that in decision making system, prediction and dation are exerted to give users best estimation regarding specific provided data Algorithmsand mathematics play a principal knowledge to implement any machine learning models,and data and its quality are the key factor to the output of any prediction concepts Variouswell-known models have been tried and put in comparison to achieve the highest accuracy.Regarding to this thesis and the above motivation, we decided to select the topic AMicroservice-based Data Crawling and Analyzing for Real Estate Websites in Viet-nam using Machine Learning to implement a web application aimed for reporting realestate status, offering some real estate forecast and real estate items emerging in the market inpresent
Trang 5recommen-List of Figures
2.1 ETL from sources to Data Warehouse 4
2.2 Three processes in ETL 5
2.3 Data Warehouse Concepts 11
2.4 Example of Dimension table 13
2.5 Example of Fact table 13
2.6 Example of Star schema 14
2.7 Step in Dimensional Modeling 15
2.8 Example of point outliers in a time series 18
2.9 Example of contextual outliers in a time series 19
2.10 Example of collective outliers in a time series 19
2.11 Z-score in the normal distribution 20
2.12 Symmetric distribution and two types of skewed data 21
2.13 Example of Label encoding and One-hot encoding in Food Name column 23
2.14 Example of a decision tree for regression of playing hours based on weather 25 2.15 Bootstrap and Aggregation 26
2.16 Bootstrap and Aggregation 27
2.17 ANN architecture 35
2.18 Example of K-fold cross validation with test data is validation data in our project 36 2.19 Microservices Achitecture 39
3.1 Use-case diagram for the whole system 46
3.2 Get real estate dashboard activity diagram 52
3.3 Search Post Activity Diagram 53
3.4 Get Predicted Real Estate Price Activity Diagram 54
4.1 High-level of system design 57
Trang 64.2 Data flow in Scrapy 58
4.3 Real estate posts 59
4.4 Metadata after integrated 60
4.5 Class diagram for integration process 61
4.6 Dashboard System 62
4.7 Data Warehouse schema design 63
4.8 Class Diagram for Dashboard service 64
4.9 Sequence Diagram for Dashboard service 65
4.10 Listings Workflow 65
4.11 Entity Relationship Diagram 66
4.12 Class Diagram for Listings Service 67
4.13 Sequence Diagram for Listings Service 68
4.14 Price Prediction Service Workflow 68
4.15 Price Prediction Service Class Diagram 70
4.16 Price Prediction Service Sequence Diagram 71
4.17 Summary of dataset used for evaluation 72
4.18 Missing values in columns of dataset 72
4.19 Example of label encoded column after applying label encoding 73
4.20 Positive distribution and skewness in target in housing data, land data, renting data respectively 74
4.21 Distribution and skewness after log transformation in price column in housing data, land data, renting data respectively 75
4.22 Positive distribution and skewness in area feature in housing data, land data, renting data respectively 77
4.23 Distribution and skewness after log transformation in area column in housing data, land data, renting data respectively 78
4.24 Example of extracting features from created_date 79
4.25 10 rows with some columns of one-hot encoding 79
4.26 Data splits for training, validation and testing 80
4.27 Component Diagram of Frontend Layer 81
5.1 Asynchronous Programming in Vertx 85
5.2 Structure of Vue component 86
5.3 The use of declarative rendering in template syntax of Vue.js 87
Trang 75.4 Code example of Vuejs 88
5.5 Code example of TypeScript in Vuejs 89
5.6 Warning in VSCode 89
5.7 Docker Engine 90
5.8 Docker Architecture 91
5.9 Layered Image 92
5.10 Kubernetes Components in Node 94
5.11 Kubernetes Relication Mechanism 95
5.12 Kubernetes Architecture 96
6.1 Dashboard service web interface 103
6.2 Real estate listing service web interface 105
6.3 Price prediction service web interface 106
Trang 8List of Tables
2.1 Comparison between Full load and Incremental load 8
2.2 Comparison between Database and Data Warehouse 10
2.3 Comparison between Microservices and Monolithic Architecture 41
4.1 Table of skewness and kurtosis before and after pre-processing of price column 76 4.2 Table of skewness and kurtosis before and after pre-processing of area column 78 6.1 Result of K-fold cross validation with k=10 for house data 99
6.2 Result of K-fold cross validation with k=10 for land data 99
6.3 Result of K-fold cross validation with k=10 for renting data 100
6.4 Performance of XGBoost finalized model on testing data 100
6.5 Performance of Star Schema versus Flat Table in BigQuery 101
6.6 Performance of Spring Boot versus Flask 102
Trang 91.1 Problem Statement 1
1.2 Goals and Scopes 1
1.3 Scientific Significance 2
1.4 Practical Significance 2
2 Theoretical Background 4 2.1 Extract Tranform Load (ETL) 4
2.2 Data Warehouse 8
2.2.1 Introduction to Data Warehouse 8
2.2.2 Data Warehouse versus Database 9
2.2.3 Components and Architecture of Data Warehouse 10
2.3 Dimensional Modeling 12
2.3.1 Introduction to Dimensional Modeling 12
2.3.2 Elements in Dimensional Data Model 12
2.3.3 Star Schema from Dimensional Modeling 14
2.3.4 Steps of Dimensional Modeling 14
2.4 Feature Engineering 16
2.4.1 Numerical imputation 17
2.4.2 Outliers removal 18
2.4.3 Log transformation 20
2.4.4 Label encoding and One-hot encoding 23
2.4.5 Date extraction 23
2.4.6 Binning 24
2.5 Machine Learning 24
Trang 102.5.1 Decision Tree Regression 24
2.5.2 Random Forest Regression 26
2.5.3 Gradient Tree Boosting 28
2.5.4 Extreme Gradient Boosting (XGBoost) 29
2.5.5 K-Nearest Neighbors (KNN) 31
2.5.6 Bayesian Ridge Regression 31
2.5.7 Linear Regression 32
2.5.8 Lasso Regression 33
2.5.9 Ridge Regression 33
2.5.10 Artificial Neural Networks 34
2.6 Cross validation and evaluation metrics 36
2.6.1 K-fold cross validation 36
2.6.2 Evaluation metrics 37
2.7 Database normalization 38
2.8 Microservices 39
3 System analysis and requirement 42 3.1 General System Features 42
3.1.1 Real Estate Dashboard Service 42
3.1.2 Listings Service 42
3.1.3 Price Prediction Service 43
3.2 Functional requirements 43
3.2.1 Real Estate Dashboard Service 43
3.2.2 Listings service 44
3.2.3 Price prediction service 44
3.3 Nonfunctional requirements 44
3.3.1 Real Estate Dashboard Service 45
3.3.2 Listings service 45
3.3.3 Price prediction service 45
3.4 Diagrams 45
3.4.1 Use case description 47
3.4.2 Activity diagrams 52
Trang 114 System implementation 56
4.1 Overall System Structure 56
4.2 Data crawling layer 58
4.3 Data integration layer 60
4.4 Service Real Estate Dashboard 62
4.5 Service Real Estate Listings 65
4.6 Service Real Estate Price Prediction 68
4.7 Data Modeling 71
4.7.1 Data used 71
4.7.2 Data pre-processing 71
4.7.3 Data split strategy 80
4.8 Frontend layer 81
5 Technology 82 5.1 Flask 82
5.2 Feature engineering and machine learning libraries 82
5.3 Spring boot 84
5.4 Vertx 84
5.5 Vuejs and Typescript 85
5.6 Docker 89
5.7 Kubernetes 93
6 Result and Evaluation 98 6.1 General System Information 98
6.2 Models Evaluation 98
6.3 Data Warehouse Evaluation 101
6.4 Listing Service Evaluation 101
6.5 Web application 102
7 Summary 107 7.1 Achievement 107
7.2 Thesis Assessment 107
7.3 Future Development 108
Bibliography 109
Trang 12Of those business has always been beneficial to investors, real estate is a big business due toits incentive There are several reasons why it is regarded as a good investment, among whichthe primary one is the profit gained from rental income, appreciation, and business activitiesdepending on property.
Being aware of these two problems, we decided to launch this project with the purposes
of gathering data from most prominent websites and equipping users with well-roundedperspective about real estate market in Vietnam through visualization, estimation, and up-to-date information
The main objective of this project is developing a system allow users to keep track oflatest real estate status in Vietnam market in present The system need to meet the demand
of coping with huge amount of data from various data sources Coincidentally, our systemshould also guarantee the overall requirements of a standard product including scalability, highperformance, high quality of data, a easy -to-use application, to name but a few Apart from
it, with the data collected, we also provide some decision-making mechanisms exerting fromMachine Learning models as well as analytics manner
In the scope of this thesis, we had implemented the following works:
Trang 13• Investigate real estate market and its business requirement nowadays and take someonline research about some popular real estate websites.
• Study methodologies and tools related to data processing like Scrapy, Machine Learning,Data Warehouse, etc
• Build a dataset for real estate from websites in long period
• Improve and integrate raw data for higher quality and take extraction, transformation,and loading processes to meet the application database requirements
• Build a Data Warehouse and apply modeling for fast querying
• Research a host of Machine Learning models and apply those in pragmatic way to predictreal estate price
• Build a page of real estate dashboard for statistics
• Build a transactional page to list the newest real estate item
• Study and make full utilities of leading-edge platform to allow the whole process to runperiodically based on time
1.3 Scientific Significance
• The thesis is not only the results from synthesis, analysis, and experiment of a widerange of Big Data techniques, but it is also the application of a variety of data miningapproaches
• The topic singles out the potential of data sources in terms of making decision, commerce sector, and further leverage of data in the future
e-• The project is the combination of various advanced technologies which are new conceptsand still not widely-used in industry nowadays
1.4 Practical Significance
• The topic has provision of solutions for a complete system for the collection and analysis
of data accumulation Currently, the topic is only encapsulated in the field of real estate,but in fact it is possible to expand to many other areas in the future such as health,education, to name but a few
• The topic also fully researches market operations and reasonable analysis in direction for
a collection and statistics application to help entrepreneurs more aggregated information
to shape the trend, the current market state takes place
Trang 14• The topic of completing the development of applications that assist users in decisionmaking business regulations in the field of real estate.
Trang 15Chapter 2
Theoretical Background
ETL ( Extract Transform Loading ) is the procedure of three functions aimed for migratingdata from one source to another source This is also known as data integration process thatimposes data quality, data consistency and data standards, hence data from different sourcescan be synthesized and put in one place to further discover business insights
Figure 2.1: ETL from sources to Data Warehouse
ETL comprised of three functions: Data Extraction, Data Transformation and Data Loadingcan be explained more details as following:
Trang 16Data Extraction
This step plays essential role in the whole procedure because the correctness of this stagewill directly affect the input of the subsequent stage The goal of Data Extraction is to retrieveall required data from sources Understanding business requirements is critical to decide whichsource or field needed to be extracted For the requirements, data sources can be from varioustypes and formats such as Databases, Files, Web Services, etc
After determining the sources and fields to be extracted, the next step is design the way toextract data and data repository to store for the next transformation process Data validation isalso involved to guarantee that expected data is pulled, otherwise, it will fail in next step oroutput the wrong reporting data
Figure 2.2: Three processes in ETL
There are three types of Data Extraction:
(a) Update notification
In this case, every time a record changes, the source system will issue a notification Manydatabases now provide this trigger mechanism and applications support this webhook.This will make Data Extraction process always be up-to-date with sources
(b) Incremental extraction
This happens when some sources are not able to provide notification in modified records,thus, changes should be able to detected programmatically The disadvantage of thisapproach is that we can not control the modification in case the record is deleted.(c) Full extraction
In the first time of extraction, full extraction is required to get all data If two approachabove can not be applied, full extraction may be the only choice In situation of not high
Trang 17data volume, it can be lightweight, however, this should be avoid when the amount ofdata is large which will put a high workload in network.
Data Transformation
Data Transformation is the process of converting data from its original form into anotherparticular format This key step plays an important role in data integration and data manage-ment A series of functions and rules are applied to clean, map and transform data to servebusiness purposes During this process, users can take comparison between datasets and makesome data-driven decisions Data transformation comprised several basic sub-processing suchas:
• Cleansing: is one of the most principal stage, which aims for obtain proper data for thetarget Inconsistency and missing values in data are removed For instance, removingNULL value, date time consistency, etc
• Standardization: also known as format revision stage, rules of format are applied forall values, for example: character set, unit of measurements, etc
• Deduplication: resolve the redundant data, discard duplicate rows
• Verification: unusable data or anomalies are removed or flagged
Furthermore, other advanced transformations are also used:
• Derivation: based on business rule, create calculation to generate new values fromexisting data, for example, function sale amount = quantity times unit price
• Filtering: only necessary rows/columns are selected to load
• Joining: linking data from multiple sources and deduplicate them
• Splitting: from one column, split it into multiple column if necessary
• Data validation: apply forms of validation; depending on exception handling and designrules for value in each column, it can lead to fully rejection data for next processing
• Summarization and Aggregation: values from multiple columns or sources are marized or aggregated to store at multiple level of details or metrics For instance,summarizing total sales for each store
sum-• Sorting: data is organized in an order to improve search performance
• Transporting (aka pivoting): turning multiple rows into columns or vice versa
Trang 18Due to the fact that challenges of Data Transformation is often time-coming and costly it
is critical to choose solution that can expedite the process Architecturally speaking, there arecouple ways to handle transformation process:
• Multistage data transformation: extracted data is moved to staging area; tion is taken in place afterward and completed after loading into next data stores
transforma-• In-warehouse data transformation: the process flow is different from above approachthat data is extracted and loaded into warehouse first and transformations are operated
in this warehouse This method is prefered in recent years
Data Loading
Data Loading is the last process in ETL, responsible for writing newly data transformedinto destination When completing this stage, data is ready enough to:
• Business Intelligence or analytics tool are layered on top of the warehouse
• Search tool may be created from here
• Fraud or outlier detection is build by Machine Learning algorithms
• Real-time alerting system is implemented when discovering unusual events
In some cases, a large volume of data is required to load in a relatively short period, thus, thisprocess should be optimized for performance Moreover, not all loading process can achievedsuccessfully, hence, there should be a recover mechanism, which save the checkpoint when theloading fails Later, the process can resume to load from this checkpoint and the system canavoid to load the whole data again and prevent the chance of data loss
Data Loading can be divided into two primary types:
• Full load: the entire data is dumped into new store, this usually takes place in the firsttime ETL
• Incremental load: The last updated time is stored, only new data is loaded at regularintervals Based on the volume of data and how data inputs, it can be:
– Streaming incremental load: data is streamed record by record
– Batch incremental load: data is grouped in batch to load
Trang 19Table 2.1: Comparison between Full load and Incremental load
Metrics Full load Incremental load
Difficulty Easy to implement
More difficult to check foradditional rows, hard to implementrecovery mechanism
Maintenance High risk when data grows
exponentially
Less expensive to maintain andmanage
2.2.1 Introduction to Data Warehouse
(a) What is Data Warehouse
Data Warehouse (DWH) is a central repository of data from multiple sources, used
to analyze for decision making The term Data Warehouse is not a new terminology,
it emerged to handle an increasing amount of information Data Warehouse is used asarchitectural construction of an information system and this can be known as follow-ing names: Decision Support System (DSS), Executive Information System, BusinessIntelligence Solution, etc
End users access DWH to select needed information for analytics Data from systemsources or external information is periodically pulled, go through ETL processes and isdumped into DWH Therefore, it is not loaded every time new data is generated, but isloaded in time intervals (daily, monthly, etc ) instead DWH is maintained separatelyfrom operational database system
(b) Advantages of Data Warehouse
DWH brings organization many benefits:
• Data is queried with good performance and better accuracy Users can accessdata from multiple sources in one place, therefore, minimize time to query fromthem as well as reducing workload on production system and provides consistentinformation, avoid stakeholders to overestimate the quality of data Moreover, whenseparating analytics processing and transactional processing, performance of bothsystems are obviously improved
Trang 20• Strategic questions and make smarter decision are answered DWH allows to storehuge amount of data including historical data to analyze and predict trends, thenpowering reports and dashboards for users to have broad view on business market.
(c) Data Warehouse characteristics
“A Data Warehouse is a subject-oriented, integrated, time-variant and nonvolatilecollection of data in support of management’s decision making process.“ - Bill Inmon, Father
of Data Warehousing
Data Warehouse has four main properties:
• Subject-oriented: Data is categorized and stored by business subjects (or theme)instead of organization operations Business subjects can be sales, real estate,marketing, etc Data Warehousing process focuses on demonstrating and analysis
of data related to specific theme by eliminating information which is not required
to make the decisions
• Integrated: Similar data from disparate sources are built in a shared entity They allfollow one unique reliable naming conventions, column scaling, encoding structureetc Due to that, it benefits in analysis of data
• Time-variant: Data is updated in period of time such as weekly, monthly, etc Therefore, the time limits for data warehouse is wide-ranged than that of operationaldatabases The data resided in data warehouse is predictable with a specific interval
of time and delivers information from the historical perspective Another feature oftime-variance is that once data is stored in the data warehouse then it cannot bemodified, altered, or updated
• Non-volatile: Once the data is updated in DWH, it will be there permanently.When in warehouse, data is read-only and reloaded at particular intervals Someoperations in normal databases including delete, update, and insert are invalid here.Instead, data loading and data accessing are two operations Transaction process,recapture and concurrency control mechanism are not necessary
2.2.2 Data Warehouse versus Database
Data Warehouse is not a Database There are some key differences between them
Trang 21Table 2.2: Comparison between Database and Data Warehouse
Metrics Database Data Warehouse
Usages Use for transactinal purposes, quickly
write/read
Use for analytical purposes, optimizefor aggregating transactional data,Processing
Data
modi-fication
Data stored in the Database is up todate
Current and Historical Data is stored
in Data Warehouse May not be up todate
Availability Data is available real-time Data is refreshed from source systems
in specific intervalsQuery
Type Simple transaction queries are used
Complex queries are used for analysispurpose
2.2.3 Components and Architecture of Data Warehouse
There are mainly five components in Data Warehouse:
(a) Central Database
The central database which is implemented on Relational Database is the foundation ofthe data warehousing environment Data collected from various sources reside here tomake it manageable
(b) ETL Tools
As mentioned above, ETL tools responsible for conversions, summarization, etc makeall data in suitable arrangment and load it into Data Warehouse These tools will handleDatabases and data heterogeneity challenges
(c) Metadata
Metadata is the data about the data warehouse and offers a framework for data It is afacilitator in building, maintaining, managing and making use of the data warehouse.Metadata is closely connected to the data warehouse Metadata is categorized into twokinds:
Trang 22• Technical Meta Data: contains information about warehouse used by developersand managers when executing related tasks
• Business Meta Data: contains detail that end users can understand informationstored
(d) Data Warehouse Access Tools
This tool acts like an interface for users to interact with Data Warehouse system Thereare some familiar tools including:
• Query and reporting tools: provide interact visuals or sheet for reporting Analternative is regular report generation
• Application development tools: custom reporting tools are developed in form
of Application in case some built-in tools do not meet user requirement
• Data mining tools: Some insights like models, patterns, trends, etc are discovered
by data mining process
• OLAP tools: multi-dimensinal data warehouse is contructed, users are allow toanalyze data using elaborate and complex multidimensional views
(e) Data Warehouse Bus
Data Warehouse bus defines data flow of warehouse It can divided into some types:Inflow, Upflow, Downflow, Outflow and Meta flow Data Bus has to be designed properlybecause shared dimensions and facts tables are acrossed data marts Data mart is accesslayer and is created for group of users, is a small version of Data Warehouse but onlyhandle one subject There are three types of Data Mart:
• Dependent Data Mart: The data after ETL from OLTP source, then populate thecentral Data Warehouse From DWH, the data drive into Data Mart
• Independent Data Mart: The data travel directly from source system This happens
in case of small organizations
• Hybrid Data Mart: the data is provided from both OLTP source and DWH
Figure 2.3: Data Warehouse Concepts
Trang 23First of all, data from source system or external file are extracted, transformed and loadedinto one place called Staging Database Staging Database is combination of all enterprise data,then go through ETL process and go to Data Warehouse, become Raw Data At this step,Metadata and Aggregate Data is constructed based on Raw Data Once every time are stable
in Data Warehouse, Data Mart related to each subject is generated and ready for applyingreporting and mining tools
Nowadays, three-tier Data Warehouse architecture is most widely used, consisting of threetier:
• Top tier (or Analytics layer): is the front-end client layer It holds the data warehouseaccess tools that let users interact with data, create dashboards and reports, monitor KPIs,mine and analyze data, build apps, etc
• Middle Tier (or Semantics layer): where OLAP and OLTP servers restructure the data forfast, complex queries and analytics This layer also acts as a intermediate layer betweenthe end user and the database
• Bottom Tier (or Data layer): Data Warehouse servers are at this layer which means itconsists of database server and data marts Metadata and data aggregation are created inthis tier by data integration tools
2.3.1 Introduction to Dimensional Modeling
After ETL raw data into Data Warehouse, Data Bus architecture is designed, then DataMarts can join across them for fast, complex retrieval Therefore, Ralph Kimball has developed atechnique called Dimensional Modeling (DM) “Dimensional Modeling includes a set ofmethods, techniques and concepts for use in data warehouse design“ which aimed formaking information easily to access, consistent, adaptable and receptive to change, information
is presented in time, authoritative and trustworthy for making decision The concept ofDimensional Model consists of conformed “fact“ and “dimension“ tables, OTLP database aretransformed into this concept by DM technique
2.3.2 Elements in Dimensional Data Model
Dimensional Data Model (DDM) consists of two elements:
(a) Dimension
• Dimension is the table describe business event like product, sale, etc
Trang 24• They are what users would want to sort, group and filter on like customer, date,etc
• Fields in dimension table contains element description
• Dimension table can be referenced by multiple fact tables
Figure 2.4: Example of Dimension table
(b) Fact
• Fact in DDM is the table containing measurements, is a measurable metric which isdescribed by the dimensions such as sale amount, order quantity, etc
• Grain of detail in fact table defined by related dimension
• Fact table includes a set of dimension keys and measure
• A measure in fact can be summed, averaged or aggregated
• Every dimension table is linked to a fact table
Figure 2.5: Example of Fact table
In this example, FK_Date, FK_Location and FK_Product reference to Dimension tableDate, Location and Product respectively while Quantity and Amount are measure-ments
Trang 252.3.3 Star Schema from Dimensional Modeling
When designing Dimensional Data Model, type of schema is chosen, schema will logicaldescription of entire Data Warehouse There are many types of schema: Star schema, Snowflakeand Fact Constellation schema However, in this study, I will focus on Star schema, which is apopular solution for warehouse design and used in this thesis In star schema,
• Each dimension in a star schema is represented with one-dimension table which containsset of attributes
• Fact table is at the center, which contains keys to every dimension table and measurableattributes
Figure 2.6: Example of Star schema
2.3.4 Steps of Dimensional Modeling
When building a Data Warehouse, the accuracy of Dimensional Modeling is will determinethe success of Data Warehouse implementation It requires a deep understanding of businessprocesses and from that, proposing a deliberative approach to design There are five main steps:identify business process, identify granularity, identify Dimensions, identify Facts and buildStar schema
Trang 26(a) Identify the Business Process:
The first step is to gather requirements and identify a business process to be covered.This could be Marketing, Sales, HR, etc as per the data analysis needs of the organization.Designers usually have discussions with business users to understand their reportingrequirements, how they perceive the business process, what data metrics would theywant to use in their reports, etc The selection of the Business process also depends onthe quality of data available for that process After that, is is also necessary to consultwith system experts who work with data source Since this is the most important step
of the Data Modeling process, and a failure here would have cascading and irreparabledefects
Figure 2.7: Step in Dimensional Modeling
(b) Identify granularity (level of detail)
Identifying granularity refers to the process of identifying the lowest level of informationfor any table in Data Warehouse, as well as level of detail in reports for the businessproblem If a table contains sales data for every day, then it should be daily granularity
If a table contains total sales data for each month, then it has monthly granularity
(c) Identify Dimensions
Dimensions are objects like sales, product, store, and employee dimensions, etc sion tables are denormalized and supports meaningful answers to business questions.Columns of dimension tables are descriptive attributes in Data Warehouse These di-mensions are where all the data should be stored For example, the date dimension maycontain data like a year, month and day
Dimen-(d) Identify Facts
Trang 27Fact table holds measurable data and contains foreign keys to each of the dimensions.This step is co-associated with the business users of the system because this is wherethey get access to data stored in the data warehouse Most of the fact table rows arenumerical values like price or cost per unit, etc
(e) Build Schema
In this step is implementation of the Dimension Model A schema is nothing but thedatabase structure aka arrangement of tables It is called a star schema because diagramresembles a star, with points radiating from a center The center of the star consists ofthe fact table, and the points of the star is dimension tables
Rules for Dimensional Modelling
When designing Dimensional Modeling, some rule must be followed:
• Load atomic data into dimensional structures
• Build dimensional models around business processes
• Need to ensure that every fact table has an associated date dimension table
• Ensure that all facts in a single fact table are at the same grain or level of detail
• It’s essential to store report labels and filter domain values in dimension tables
• Need to ensure that dimension tables use a surrogate key
• Continuously balance requirements and realities to deliver business solution to supporttheir decision-making
Feature engineering is the technique of preparing the proper input data-set, compatiblewith the machine learning algorithm requirements and improving the performance of machinelearning models by transforming its feature space, and it is the practice of constructing suitablefeatures from given features of the data-set However, several techniques should be applied todata-set for better performance and a prediction result such as: numerical imputation, dateextraction, binning, outliers removal, log transformation, one-hot encoding, label encoding
Trang 282.4.1 Numerical imputation
(a) Missing values
Missing values is one of the most common problems and biggest challenges that areencountered by data scientists when trying prepare data for machine learning models.The reason for the missing values might be human errors, interruptions in the data flow,privacy concerns, and so on Whatever is the reason, most machine learning algorithmsare not powerful enough to handle missing data Missing data can lead to ambiguity,misleading conclusions and results or even unable to train models
There are two types of missing values; the first type is called missing completely atrandom (MCAR) MCAR can be expressed as:
The imputation of missing data method has two types, single imputation and multipleimputations Single imputation contains several approaches, such as zero imputation,mean imputation, median imputation and regression imputation Mean imputation,median imputation and zero imputation are mostly used by data scientists It replacesthe missing data with sample mean, median or zero However, a disadvantage of zeroimputation is that if the feature is the count of something, it may be a sensible solution,
Trang 29mean imputation has a cons which is if missing data are enormous in number, then allthose data are replaced with the same imputation mean, which leads to change in theshape of the distribution It is also sensitive to outliers, while medians are more solid
in this respect Regression imputation is a technique based on the assumption of linearrelationship between the attributes The advantage of regression imputation over meanimputation is that it was able to preserve the distribution shape
2.4.2 Outliers removal
(a) Outliers
Outliers are noisy data that lies an abnormal distance from other values in a randomsample from a population or they do have abnormal behaviour comparing with the rest
of the data in the same dataset
Outliers can be of two kinds: univariate and multivariate Univariate outliers can befound when looking at a distribution of values in a single feature space Multivariateoutliers can be found in a n-dimensional space (of n-features) Outliers can also come indifferent flavours, depending on the environment:
• Point outlier (global outlier): Point outlier is an individual data instance thatcan be considered as odd with respect to the rest of the data For example, intrusiondetection in computer networks
Figure 2.8: Example of point outliers in a time series
• Contextual outlier (Conditional outlier): The contextual outlier is an instance
of data that can be regarded as odd in a specific context or condition but nototherwise An example of contextual is the longitude of a location Or in time series,the “context” is almost always temporal, because time series data are records of aspecific quantity over time So, contextual outliers are common in time series datathat values are not outside the normal global range, but are abnormal compared tothe seasonal pattern
Trang 30Figure 2.9: Example of contextual outliers in a time series
• Collective outlier: If a collection of data points is anomalous with respect to theentire data set, it is termed as a collective outlier For example, many continuousindividual data points are so simple in data-set, but when combining them together,they are anomalous with respect to the entire data set
Figure 2.10: Example of collective outliers in a time series
There are many common causes of outliers on a dataset: data entry errors (humanerrors), measurement errors (instrument errors), experimental errors (data extraction
or experiment planning/executing errors), intentional (dummy outliers made to testdetection methods), data processing errors (data manipulation or data set unintendedmutations), sampling errors (extracting or mixing data from wrong or various sources),natural (not an error, novelties in data)
Some of the most popular methods for outlier detection are: Z-score or extreme valueanalysis (parametric), probabilistic and statistical modeling (parametric), linear Regres-sion Models (PCA, LMS), proximity based models (non-parametric), information theorymodels high dimensional outlier detection methods (high dimensional sparse data)
(b) Outlier detection using Z-score
Trang 31The z-score or standard score of an observation is a metric that indicates how manystandard deviations a data point is from the sample’s mean, assuming a gaussian distri-bution For example, a Z-score of 2 indicates that a data point is two standard deviationsabove the average while a Z-score of -2 signifies it is two standard deviations below themean A Z-score of zero represents a value that equals the mean Z-score is a simple andpowerful method to get rid of outliers in data when dealing with gaussian distributions
in a low dimensional feature space Very strange data points can not be described by agaussian distribution, this problem can be solved by applying transformations to data(log transformation in our project)
After making the appropriate transformations to the selected feature space of the dataset,the formula of z-score for any data point can be calculated with the following expression:
Z = X− µ
σThe further away an observation’s Z-score is from zero, the more unusual it is A standardcut-off value for finding outliers are Z-scores of +/-3 or further from zero
Figure 2.11: Z-score in the normal distribution
2.4.3 Log transformation
(a) Skewness
If one tail is longer than another, the distribution is skewed These distributions aresometimes called asymmetrical distributions Symmetry means that one half of the
Trang 32distribution is a mirror image of the other half For example, the normal distribution is asymmetric distribution with no skew The tails are exactly the same.
There are two types of skewed data that are left-skewed and right-skewed A left-skeweddistribution has a long left tail Left-skewed distributions are also called negatively-skewed distributions because there is a long tail in the negative direction on the numberline The mean is also to the left of the peak In contrast, a long tail in the positivedirection and the mean is to the right of the peak, it is right-skewed (positively-skewed)distributions
Figure 2.12: Symmetric distribution and two types of skewed data
In the figure, with positive skewed distribution, the value of mean is the greatest onefollowed by median and then by mode (mode < median < mean) On the contrary,mean is the smallest one in negative skewed distribution (mean < median < mode)Skewness is the measure of how much the probability distribution of a random variabledeviates from the normal distribution which is the probability distribution without anyskewness There are several ways to measure skewness Pearson’s first and secondcoefficients of skewness are two common ones
• Sk1and Sk2are Pearson’s first and second coefficient of skewness
• s is the standard deviation for the sample
• ¯Xis the mean value
• Mo is the modal (mode) value
Trang 33• Md is the median value.
With left skewed data, the value of skewness is negative and it is contrary to right skeweddata If the data have a weak mode or multiple modes, Pearson’s second coefficient may
be preferable, as it does not rely on mode as a measure of central tendency
If there are too much skewness in the data, then many machine models do not workbecause the tail region may act as outliers for the statistical model and we know thatoutliers adversely affect the model’s performance especially regression-based models.Hence, there is a necessity to transform the skewed data to close enough to a gaussiandistribution or normal distribution So, we need log transformation method as a solution
is the biased sample central moment, and is the sample mean
Apart from skewness, we also measure kurtosis using Scipy library that is a statisticalmeasure that defines how heavily the tails of a distribution differ from the tails of anormal distribution In other words, it identifies whether the tails of a given distributioncontain extreme values
(b) Log transformation method
A log transformation is a method that is used to address skewed data It is used to makedata conform to normality, it reduces the impact of the outliers, due to the normalization
of magnitude, but it still preserves distances of data points in the dataset and to reducethe variability of data because high variability means that the values are less consistentthat it is harder to make predictions
The formula for log transformation:
Y = log10(X)with:
• Y is the transformed feature
• X is the original feature
Trang 342.4.4 Label encoding and One-hot encoding
Label encoding is the feature engineering technique that is used to convert the labels intonumeric form so as to convert it into the machine-readable form It converts a column with nobservations of d distinct values which is in inappropriate form into only one column with dnumeric labels
One-hot encoding is one of the most common encoding methods in machine learning It isused to convert categorical features to a suitable format which is understandable for machinelearning algorithms This technique transforms a single categorical variable with n sampleswith d distinct values to d binary variables with values of 0 and 1 that 1 is presence and 0 isabsence
For example, Food Name column is converted to Categorical column in Label encoding and
3 variables from 3 distinct values of Food Name in One-hot encoding
Figure 2.13: Example of Label encoding and One-hot encoding in Food Name column
2.4.5 Date extraction
Date column is present in various formats which are nonsensical for machine learningalgorithms, for example, date is simple like: ’1/1/2020’ that cannot be used in training models.Furthermore, building an ordinal relationship between the values is very challenging for amachine learning algorithm if you leave the date columns without manipulation Here, thereare three types of pre-processing for dates:
• Extracting the parts of the date into different columns: year, month, day, etc
• Extracting the time period of the date in year, month, week like: season, weekday, etc
• Extracting some specific features from the date: weekend or not, holiday or not, starting
of a month/year or not, end of month/year or not, etc
Trang 352.4.6 Binning
Binning is the feature engineering technique that can be applied on both categorical andnumerical data Observed values can be combined into groups, splitting into smaller sub-intervals or assigning a general category to these less frequent values Also, binning can
be considered as data discretisation, which is a technique to cut a continuous value rangeinto a finite number of sub-ranges, where a categorical value is associated with each of them.The main motivation of binning is to reduce the impact of statistical noise, reduce overallcomplexity, make the model more robust and prevent overfitting For instance, for categoricalcolumns, the labels with low frequencies probably affect the robustness of statistical modelsnegatively, so, replacing less frequent values with a general one helps to keep the robustness
of the model However, it has a cost to the performance because when binning something, itsacrifices information that is exactly the trade-off between performance and overfitting.Although there are several binning methods, equal-width, equal-size, and multi-intervaldiscretisation binning are common techniques Equal-width binning is an approach wherethe whole range of predictor values is divided into a pre-specified number of equal-widthintervals And equal-size binning is an approach where the variety of predictor values is splitinto intervals in a way that bins contain an equal number of observations, the width of binsdepends on the density of observations About multi-interval discretisation binning, for eachcontinuous-valued attribute, we select the best cut point T from its range of values Aftersorting with increasing value of the attribute A, the midpoint between each successive pair
of examples is a potential cut point Once cut point T is found for compete interval of S, theprocess is repeated for sub-intervals recursively until there is no substantial improvement inentropy By minimizing information entropy of the partition induced by T: S1∈ S, S2= S −S1
Ent(S) = −Pk
i=1P(Ci, S)log(P (Ci, S))with Ci are classes in input dataset S
2.5.1 Decision Tree Regression
Decision tree is the non-parametric supervised learning algorithm which is used in datamining and machine learning The structure of a tree includes many nodes which correspond
Trang 36to features, and edges between nodes are values of those attributes In addition, leaves arelabels (classification) or values of target variable (regression).
The core algorithm for building decision trees called ID3 by J R Quinlan which employs atop-down, greedy search through the space of possible branches with no backtracking TheID3 algorithm can be used to construct a decision tree for regression by replacing InformationGain with Standard Deviation Reduction
Figure 2.14: Example of a decision tree for regression of playing hours based on weather
The standard deviation reduction is based on the decrease in standard deviation after adataset is split on an attribute Constructing a decision tree is all about finding attribute thatreturns the highest standard deviation reduction It is difference between standard deviation ofparent node and summing of standard deviation of all children nodes which is calculated by:S(T, X) =P
cǫXP(c)S(c), with X is a feature for splitting a parent node to many childrennodes P(c) is the proportion of number of samples of dataset in a child node which is splitted
by the attribute to number of dataset in parent node And S(c) is standard deviation of output
of the dataset in a child node So, standard deviation reduction is measured by:
SDR(T, X) = S(T ) − S(T, X)
with S(T) is standard deviation of target variable in the dataset in parent node
Decision tree has many upsides and downsides as following:
• It is easy to understand and visualization for decision tree is also supported for tion
explaina-• Data processing and preparation are simple for decision that it can deal with missingvalues well
• Decision tree is possibly overfitting when size of the tree is big - it learns and fits much
in training data, but does not perform well in testing data
Trang 372.5.2 Random Forest Regression
A Random Forest is an ensemble technique capable of performing both regression andclassification tasks with the use of multiple decision trees and a technique called Bootstrap andAggregation, commonly known as bagging
Figure 2.15: Bootstrap and Aggregation
Bootstrap refers to random sampling with replacement Each decision tree selects a subset
of features and many random data samples, which are chosen using bootstrap technique,for training And, combining multiple decision trees in determining the final output ratherthan relying on individual decision trees is called aggregation In particular, in the case of aclassification problem, the final output is taken by using the majority voting classifier (mode ofclasses) and in the case of a regression problem, the final output is the mean of all the outputs(mean prediction)
Trang 38Figure 2.16: Bootstrap and Aggregation
In random forest regression, the whole process is that:
• Each tree is created from a different data samples and at each node, a different sample offeatures is selected for splitting
• Each of the trees makes its own individual prediction
• These predictions are then averaged to produce a single result
f(x) = 1
M
PM m=1fN(x)
where M is the number of trees, fN(x) is a prediction of a tree with N training examplesfor the data point x
Every decision tree has high variance, but when we combine all of them together in parallelthen the resultant variance is low as the aggregated result of many trees always outperformsany individual decision tree’s output
The number of features used at each node of a tree is limited to a part of the total (notall features used) This ensures that the ensemble model does not rely too heavily on anyindividual feature, and makes fair use of all potentially predictive features Each tree is also
Trang 39trained with data samples which are drawn randomly that prevents overfitting So, they helpprevent the trees from being too highly correlated.
Advantages of random forest:
• It is one of the best learning algorithms available for both classification and regression
• Easy to use, there are not so many parameters
• Data preparation is simple or unnecessary
• It runs efficiently on large databases because at each node selects some percentage oftotal features and learning in many trees are parallel in random forest
• Can handle both continuous and categorical features
• It is effective when a large proportion of the data are missing
• It can handle thousands of input variables without variable deletion
Disadvantages of random forest:
• Random forest is a black box model, the result is difficult to explain
• Random forests have been observed to overfit for some datasets with noisy tion/regression tasks
classifica-• While bagging gives us more accuracy, it is computationally expensive and may not bedesirable depending on the use case
2.5.3 Gradient Tree Boosting
Gradient boosting came from the AdaBoost that weak learners can be converted into stronglearners Weak and strong learners are just trees that each new tree is a fit on a modifiedversion of the original data set AdaBoost and Gradient Boosting Algorithm have differentways to identify the shortcomings of weak learners (eg decision trees) that Adaboost usinghigh weight data points which are increased after evaluating the previous tree (punishment forobservations that are difficult to predict), gradient boosting uses gradients in the loss function
On each iteration, the gradient descent is first computed in order to fit a new base learnerfunction Once the best gradient descent step size is found, the function estimation is updated.Here is the formula of gradient descent:
θn + 1= θn− ηL′(θn)
Trang 40where η is the learning rate and L′(θn) is the gradient.
The new model is fitted using information of errors from the previous model The resultingpredictors are the combination of all models GBMs build trees at the same time, and new treeshelp to correct errors made by previous trees Typically, number of trees, depth of trees andlearning rate are the most important parameters The recommendation is to choose smallertrees rather than larger ones Other regularization method includes stochastic gradient boostingwhich is the boosting with sub sampling per split, and regularized gradient boosting which isthe boosting with L1 or L2 regularization At each iteration, the data are sampled into withoutreplacement and the tree is built only using the sample of data
Compared to random forest, GBM tends to have overfitting, but at the same time, if thehyperparameter setting is correct, GBM can have strong prediction to the data
2.5.4 Extreme Gradient Boosting (XGBoost)
Extreme Gradient Boosting (XGBoost) is the most popular variant of the gradient treeboosting proposed by Friedman Gradient tree boosting is an ensemble boosting method thatcombines a set of weak learners to create a strong learning that both XGBoost and gradient treeboosting follow this principle The key differences between them lie in implementation detailsthat XGBoost is based on function approximation by optimizing specific loss functions, alsoapplying several regularization techniques Moreover, XGBoost achieves better performance
by controlling the complexity of the trees using different regularization techniques
Let (x1, y1),(x2, y2), ,(xn, yn) be a set of inputs and corresponding outputs The tree ensemblealgorithm uses K additive functions, each representing a CART, to predict the output Thepredicted output is given by the sum of each individual function prediction, see equation below: