Applying machine learning methods to predict real estate prices

Applying machine learning methods to predict real estate prices Áp dụng phương pháp học máy để dự đoán giá bất động sản

INTRODUCTION

Chapter II: Materials and methods: Introduction to the formation and development of machine learning, methods and applications in practice, Introduction to the Kaggle platform, competitions, and data

Chapter III presents the findings of the study focused on predicting specific real estate prices through the application of machine learning models The results demonstrate the effectiveness of these models in accurately estimating property values, highlighting the potential of machine learning in the real estate market.

Chapter IV: Achieved results: Achieved results

Chapter V: Conclusion and development direction: Summarize, draw lessons and propose future development directions

 Overview of real estate, artificial intelligence and application of artificial intelligence in real estate business

Real estate encompasses land and permanently attached structures, including homes and commercial properties like offices, shops, factories, and warehouses As a vital component of the economy, it significantly contributes to the GDP of numerous countries and plays a crucial role in enhancing the quality of life for individuals.

Real estate can be classified into many different types Some main classifications include:

● Residential real estate: Includes single houses, apartments, villas, and other types of housing

● Commercial real estate: Including offices, shops, shopping malls, hotels, and other types of real estate for business purposes

● Industrial real estate: Includes factories, warehouses, workshops, and assets related to manufacturing and logistics activities

● Agricultural real estate: Including farmland, livestock land, forests, and other types of real estate for agricultural purposes

Real estate has unique characteristics that no other type of property acquires, including:

● Location Fixation: Real estate cannot be moved, it is tied to a specific geographical location

● Durability: Real estate has a long lifespan, which can last for generations

● Scarcity: Land is a limited resource that cannot be expanded according to demand

● Uniqueness: Each real estate property has its own unique characteristics, no property is exactly the same

1.3 Real estate value and factors affecting real estate

The value of real estate depends on many factors, including:

● Location: This is the most important factor, greatly affecting the value of real estate Prime locations, close to the center, with convenient transportation often have a higher value

● Size and area: Large area, suitable size for use needs is also a factor that increases real estate value

● Legal status: Real estate with complete, clear legal documents is usually worth more

● Surrounding utilities: Being close to utilities such as schools, hospitals, parks, and supermarkets also increases the value of real estate

● Market: Market trends, supply and demand also greatly affect real estate value

2 Artificial intelligence and application of artificial intelligence in real estate business

Artificial Intelligence (AI) is a field within computer science focused on creating machines that can perform tasks typically requiring human intelligence These tasks encompass learning from experience, recognizing voices, solving problems, making decisions, and forecasting outcomes AI can be categorized into several distinct types.

● Narrow AI: Specializes in performing a specific task, such as playing chess

● Strong AI (General AI): Able to understand, learn, and apply knowledge in a general way like a human

● Superintelligent AI: Far beyond human intelligence in all fields

2.2 Application of AL in real estate business

The application of artificial intelligence in the real estate sector enhances property sales, rentals, and management by streamlining processes, improving services, and informing decision-making.

1 Real estate price prediction: AI can analyze market data, economic trends, and other factors to predict the future value of real estate assets

2 Optimize the buying and selling process: Use AI to automate processes such as document checks, contract processing, and credit checks, saving time and money

3 Enhance customer experience: AI can provide personalized services through chatbots, virtual consulting systems, and advanced asset search engines

4 Efficient asset management: AI assists in asset maintenance, incident forecasting, and management cost optimization

5 Market analysis: AI is capable of analyzing data from a variety of sources to provide detailed and accurate market reports, assisting investors and buyers in decision-making

The real estate prediction problem exemplifies a key application of AI in the industry, aiming to forecast the value or development trends of real estate assets by analyzing historical and current data To effectively address this challenge, various techniques, including machine learning, deep learning, and big data analytics, are frequently employed.

The main steps in the real estate prediction problem include:

1 Data collection: Gathering data from various sources such as transaction information, asset records, economic and social data, and more

2 Data Preprocessing: Cleansing, normalizing, and transforming data to fit AI models

3 Model building: Use ML algorithms such as linear regression, decision trees, artificial neural networks to build predictive models

4 Model evaluation: Use indicators such as MSE, RMSE, MAE

5 Deployment, Tracking: Apply model to practice and continuously monitor and update to improve accuracy

The application of this problem not only helps investors and buyers make smarter decisions but also supports real estate companies in planning business strategies.

MATERIAL AND METHODS

Introduction to Data

The data used in this were taken from Kaggle

Kaggle, owned by Google, is a premier online platform for data science and machine learning enthusiasts, serving as a vibrant community that connects a vast network of professionals in the field It offers opportunities to engage with industry experts, tackle real-world challenges, and enhance your skills Kaggle is an ideal environment for building innovative ideas, competing in data-driven contests, and gaining practical experience in data science and machine learning.

2 Why use Kaggle for data science research

Kaggle, owned by Google, serves as a vital online community and resource for data science and machine learning enthusiasts It connects a diverse network of professionals and students, making it a premier hub for collaboration and learning in the field.

Kaggle stands out for its extensive collection of diverse datasets, offering over 50,000 options for users to access and utilize in model training These datasets facilitate the practical application of theoretical knowledge, significantly enhancing the efficiency of the learning process.

Kaggle offers a comprehensive library of code snippets and templates tailored for various learning objectives, making it an invaluable resource for aspiring programmers and data scientists Beginners are encouraged to start with Python, the most widely used programming language in data science, to familiarize themselves with essential coding patterns For those at a more advanced level, Kaggle also provides code resources in R, Julia, and SQLite, catering to a broader range of programming needs.

More importantly, Kaggle presents sample code in a customizable Jupyter

Notebook format, allowing you to edit the file and make the necessary changes

Data science, while often perceived as complex, is more accessible than many realize, with various theories that can be better understood through practical courses available on Kaggle These free courses not only cover essential data science concepts but also offer accredited certificates upon completion For those seeking quicker learning options, Kaggle provides concise resources that eliminate the need for lengthy online courses As a comprehensive online community, Kaggle allows data scientists to learn from peers, network, and showcase their work, which is crucial for building a professional reputation and enhancing job search prospects.

Participating in competitions is a great way to evaluate your skills and gain essential experience in data science By successfully completing more tests, you can boost your confidence throughout your research journey Kaggle offers numerous contests that allow you to challenge yourself against others while enhancing your resume's appeal Additionally, many of these competitions feature cash prizes, making them even more enticing.

Kaggle.com offers free access to a variety of datasets, allowing users to engage in contests, explore sample projects, and showcase their work without any cost Sign up today to create your account and start leveraging these valuable resources.

Kaggle, acquired by Google in 2017, is a dynamic community platform for data scientists and analysts, providing a wealth of resources It hosts numerous data competitions and projects, fostering an engaging environment where analysts can tackle real-world challenges faced by various businesses and organizations.

Here are some of the main applications of Kaggle:

1 Data Competitions: Kaggle organizes data competitions ranging from stock price predictions to predictions of medical data-based pathologies

2 Learning and practice: The platform provides learning materials and practice exercises with real-world datasets, helping you learn and apply data science skills

3 Data discovery and presentation: Kaggle allows you to explore public datasets and share your data with the community

4 Data model development: Data scientists can use Kaggle to develop and test data models on real-world datasets

5 Knowledge and Experience Sharing: The Kaggle community is large and active, with forums and Q&A websites where you can share your knowledge, experiences, and Q&A about data science

With these apps, Kaggle is not only a data problem-solving platform, but also a strong learning and knowledge sharing community.

What is the data?

Dataset Description : Zillow Prize: Zillow's Home Value Prediction (Zestimate)

- The data includes complete list of real estate data in three counties (Los Angeles, Orange and Ventura, California) in 2016

- Ship data includes all transactions before October 15, 2016, plus some transactions after October 15, 2016

Properties_2016.csv - all properties have their home characteristics for 2016 Note:

Some new 2017 properties don't have any data yet except for their parcels Those data points will be populated when they belong to tính_2017.csv

'airconditioningtypeid' Type of cooling system present in the home (if any)

Architectural style of the home (i.e ranch, colonial, split-level, etc )

'basementsqft' Finished living area below or partially below ground level

'bathroomcnt' Number of bathrooms in home including fractional bathrooms

'bedroomcnt' Number of bedrooms in home

'buildingqualitytypeid' Overall assessment of condition of the building from best (lowest) to worst (highest)

The building framing type (steel frame, wood frame, concrete/brick)

'calculatedbathnbr' Number of bathrooms in home including fractional bathroom

'decktypeid' Type of deck (if any) present on parcel

'threequarterbathnbr' Number of 3/4 bathrooms in house (shower + sink + toilet)

Size of the finished living area on the first (entry) floor of the home

Calculated total finished living area of the home

'finishedsquarefeet6' Base unfinished and finished area

'finishedsquarefeet50' Size of the finished living area on the first (entry) floor of the home

Federal Information Processing Standard code - see https://en.wikipedia.org/wiki/FIPS_county_code for more details

'fireplacecnt' Number of fireplaces in a home (if any)

'fireplaceflag' Is a fireplace present in this home

'fullbathcnt' Number of full bathrooms (sink, shower + bathtub, and toilet) present in home

Total number of garages on the lot including an attached garage

'garagetotalsqft' Total number of square feet of all garages on lot including an attached garage

'hashottuborspa' Does the home have a hot tub or spa

'heatingorsystemtypeid' Type of home heating system

'latitude' Latitude of the middle of the parcel multiplied by 10e6

'longitude' Longitude of the middle of the parcel multiplied by 10e6

'lotsizesquarefeet' Area of the lot in square feet

'numberofstories' Number of stories or levels the home has

'parcelid' Unique identifier for parcels (lots)

'poolcnt' Number of pools on the lot (if any)

'poolsizesum' Total square footage of all pools on property

'pooltypeid10' Spa or Hot Tub

'pooltypeid2' Pool with Spa/Hot Tub

'pooltypeid7' Pool without hot tub

County land use code i.e it's zoning at the county level

'propertylandusetypeid' Type of land use the property is zoned for

Description of the allowed land uses (zoning) for that property

Census tract and block ID combined - also contains blockgroup assignment by extension

'censustractandblock' Census tract and block ID combined - also contains blockgroup assignment by extension

'regionidcounty' County in which the property is located

'regionidcity' City in which the property is located (if any)

'regionidzip' Zip code in which the property is located

'regionidneighborhood' Neighborhood in which the property is located

'roomcnt' Total number of rooms in the principal residence

Type of floors in a multi-story house (i.e basement and main level, split-level, attic, etc.) See tab for details

What type of construction material was used to construct the home duplex, 3 = triplex, etc )

'yardbuildingsqft26' Storage shed/building in yard

'yearbuilt' The Year the principal residence was built

'taxvaluedollarcnt' The total tax assessed value of the parcel

The assessed value of the built structure on the parcel

'landtaxvaluedollarcnt' The assessed value of the land area of the parcel

'taxamount' The total property tax assessed for that assessment year

'assessmentyear' The year of the property tax assessment

'taxdelinquencyflag' Property taxes for this parcel are past due as of 2015

'TaxDelinquencyYear' Year for which the unpaid propert taxes were due

Table 1 : Data set Properties_2016 train_2016.csv - Transactional training from 1/1/2016 to 31/12/2016

With this dataset, there will be 3 main features: 'parcelid', 'logerror', 'transactiondate'

- Feature 'parcelid' : The ID that connects to the field next to the file properties

- Feature 'logerror': the signal of the prediction minus the actual price l o ger r r o r = l o g( Ze s t i m a t e ) − l o g( Sa l e Price )

- The 'transactiondate' feature will talk about the transaction times

The correlation matrix displayed in the image illustrates the relationships between property pairs within the dataset, with correlation values ranging from -1 to 1 These values signify the strength and direction of linear relationships between attributes, providing insights into how closely related they are.

● 0: There is no linear correlation

Dark red cells indicate a strong correlation, while blue tiles indicate a weak or negative correlation

- finishedsquarefeet13 and finishedsquarefeet15: Very high correlation , indicating that the two attributes can measure the same factor or are closely related to each other

- yardbuildingsQFT17 and yardbuildingsQFT26: There is a very high correlation, indicating a strong association between yard areas.

About Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables computers to enhance their performance through training data and experience This technology empowers machines to independently make predictions and decisions without explicit programming.

Deep Learning is a subset of Machine Learning, which is potentially different in several important

In general, Machine Learning has 5 important steps:

1 Data collection: In order for a computer to learn you need a dataset, which you can collect yourself or take previously published datasets

2 Preprocessing: this step is used to standardize data, remove unnecessary attributes, assign data labels, encrypt some features, extract features, shorten data but still ensure results This step is the most time-consuming and is proportional to the amount of data you have Steps 1 and 2 typically account for more than 70% of the total implementation time

3 Train model: this step is whether you train the model or let it learn from the data you collected and processed in the first two steps

4 Evaluating model – evaluating the model: after training the model, we need to use measurements to evaluate the model, depending on each different measurement, the model is also evaluated well or not The accuracy of the model reaching over 80% is said to be good

5 Improve: after evaluating the model, the models with poor

3 Machine Learning Methods a, Supervised Machine Learning:

The majority of actual machine learning applications use supervised learning [6]

SL is where you have input variables (X) and output variables (Y), and you use algorithms to learn the input-to-output mapping function

The goal is to build the mapping function in the best possible way so that when you

Supervised learning involves an algorithm that learns from an input dataset, resembling a "teacher" who oversees the learning process The two primary approaches to supervised learning are classification and regression.

- Classification: Classification occurs when the output variable is a certain category, such as "red" or "blue" or "disease" and "no disease"

- Regression: Regression occurs when the output variable is a real value, such as

Some common types of problems are built on classification and regression that correspond to the mechanism of time series suggestion and prediction

Some common examples of supervised machine learning algorithms are:

- Linear regression for regression problems

- The "Random Forest" principle for classification and regression

- Support vector systems for classification problems b, Unsupervised Machine Learning:

Unsupervised learning is a group of algorithms or engineering methods that allow machines to learn on their own and find a model or structure hidden in an unlabeled dataset

Unsupervised Learning aims to analyze and model the underlying structure of a dataset, enhancing our understanding of its composition This approach is designed to uncover and reveal valuable insights and patterns that may be hidden within the data It is particularly useful in the context of trading data, where non-custodial methods are frequently employed.

Unattended ML is where you only have input data (X) and no corresponding output variables

There is no right answer and there is no "teacher" at all Algorithms are created just to discover and represent useful structures inside the data c, Semi-supervised Learning:

Figure 8: Model Semi-supervised Learning

Partially supervised learning involves constructing a model using a substantial dataset (X) where only a portion of the data is labeled (Y) This approach is positioned between supervised and unsupervised learning, known as semi-supervised learning.

Labeling data for machine learning typically necessitates the expertise of a skilled technician to manually classify training examples This manual labeling process can be prohibitively expensive, making it impractical to label large datasets, whereas unlabeled data is generally more affordable.

In that situation, learning semi-supervision has great practical value

Co-training is a notable example of semi-supervised machine learning, where multiple sets of learners are trained on the same dataset Each learner utilizes a distinct set of features that are ideally independent from one another, enhancing the model's ability to generalize and improve performance This approach effectively leverages the strengths of different characteristics to optimize learning outcomes.

Reinforcement learning, a key subset of machine learning, focuses on selecting optimal actions to maximize rewards in specific scenarios This approach is utilized by various software and machine learning applications to determine the most effective behaviors or paths to follow in given situations.

The environment is typically modeled as a finite Markov Decision Process (MDP), where reinforcement learning algorithms are intricately linked to dynamic planning methods In this framework, both state transition probabilities and profit probabilities are generally stochastic in nature.

Unlike supervised learning, in reinforcement learning, there are no correct input/outcome data pairs, and near-optimal actions are not clearly evaluated for right and wrong

In this project, I employed Supervised Machine Learning (SML) to develop a predictive model that classifies data using labeled input-output pairs This approach allowed me to harness the model's comprehensive understanding, enabling accurate and reliable predictions for new data.

4 Common Algorithms of Machine Learning

Random forests are a supervised learning algorithm

Use trees as a foundation A random forest is a set of Decision Trees, each of which is selected according to an algorithm based on randomness

Decision Tree is the name that represents a group of development algorithms based on Decision Trees

In a decision tree, each node represents an attribute, while the branches indicate the possible values for that attribute By navigating through the attribute values, the decision tree predicts the outcome effectively.

Random Forests is a highly accurate and robust machine learning method, with its performance significantly influenced by the number of trees utilized in the model This algorithm effectively mitigates overfitting issues, making it suitable for both classification and regression tasks There are two approaches to manage the output values generated by the Random Forests algorithm.

● Use averages to replace continuous variables

● Calculate the approximate average of the missing values

In the Random Forest algorithm, individual decision trees are built using subsets of the training data and features, which can lead to each tree not performing optimally As a result, while these trees are designed to avoid overfitting, they may instead suffer from underfitting, impacting their overall predictive accuracy.

When considering a purchase on Tiki, it's essential to read multiple product reviews rather than relying on a single opinion, as one review may reflect a subjective experience or an isolated defect To gain a comprehensive understanding of the product's quality, potential buyers should examine all available reviews before making a final decision.

Tools used in the problem

Python is a versatile programming language widely utilized in web applications, software development, data science, and machine learning Its efficiency and ease of learning make it a popular choice among developers, and it supports multiple platforms, enhancing its accessibility and usability.

Python enables programmers to create concise and easily understandable code While machine learning (ML) and artificial intelligence (AI) often introduce complexity through intricate algorithms and adaptable workflows, Python's straightforward nature provides a solution for developing more reliable systems.

In this problem, I use Python for most of the steps in the process of building a Machine Learning project, which is to fetch data, process data, and perform demos

Scikit-learn (Sklearn) is the most powerful library for machine learning algorithms written on the Python language [11]

The library provides a set of tools for machine learning and statistical modeling problems including:

“ classification, regression, clustering, and dimensionality reduction.’

The library is licensed under the FreeBSD standard license and runs on many Linux platforms Scikit-learn is used as a material for learning

Scikit-learn strongly supports in building products

The library focuses on modeling data It does not focus on data transmission, transformation, or data synthesis[11] These jobs are for the Numpy and Pandas libraries

Here are some of the groups of algorithms built by the scikit-learn library:

● Clustering: Unlabeled data clustering algorithm grouping For example, the

● Cross Validation: Cross-testing, evaluating the effectiveness of the monitoring learning algorithm using validation data in the model training process

● Datasets: Includes a group of Datasets that are built into the library Almost all datasets have been standardized and bring high performance during training such as iris, digit, etc

Dimensionality reduction is a key algorithmic technique aimed at minimizing the number of significant attributes in a dataset through methods like aggregation, data representation, and feature selection One prominent example of this is Principal Component Analysis (PCA), which effectively simplifies data while retaining essential information.

● Ensemble methods: use multiple learning algorithms

● Feature selection: Feature selection Selection of Meaningful Features in

● Parameter Tuning: Fine-tune the parameter The algorithms serve the selection of appropriate parameters to optimize the model

● Manifold Learning: Synthetic learning algorithms and Complex

● Supervised Models: Supervised Learning A large array of machine learning algorithms today

Developers utilize Matplotlib to create high-quality two-dimensional (2D) and three-dimensional (3D) graphics for data visualization This powerful library allows for the simultaneous display of multiple charts, with the added flexibility of moving graphic details across various platforms.

● Figure: A window frame that contains everything we draw on it

Axes are essential components of a figure, serving as smaller frames for drawing While figures act as containers, it is the axes that hold the actual drawings, and a figure can consist of one or multiple axes.

● Axis: It is a line of numbers that resembles objects and is responsible for creating chart limits

● Artist: Everything you can see on a figure is an artist, such as Text objects, Line2D objects, collection objects

Utilizing the Matplotlib library enables effective data visualization, allowing programmers to gain insights into data distribution, which aids in selecting appropriate methods for processing and model building.

Pandas is a powerful and versatile Python library designed for data analysis and manipulation, making it an essential resource for processing, filtering, and aggregating various types of data Built on the Python programming language, Pandas facilitates the extraction of insights from data, and is recognized as one of the top 26 tools in data science research Its fast, flexible, and user-friendly open-source code enhances the efficiency of data processing and analysis tasks.

Components: The two main components of Pandas are Series and DataFrame A Series is basically a column, and a DataFrame is a multidimensional table made up of a set of Series

Figure 17: Series and Dataframes in Pandas

There are many ways to create a new DataFrame, one of the great options is to use dict

Introduce : NumPy is a popular library that developers use to easily create , manage teams, manipulate logical shapes, and perform linear algebraic operations

NumPy supports integration with multiple languages such as C and C++

NumPy, short for Numeric Python, is a highly regarded and robust mathematical library in Python It offers optimized functions that facilitate efficient manipulation of matrices and arrays, significantly enhancing processing speeds, particularly for large datasets, compared to standard Python operations.

● Mathematical and logical operations on the array

● Fourier transforms and processes for manipulating shapes

● The operations involve linear algebra NumPy has built-in functions for linear algebra and random number generation

● NumPy – A Good Alternative to MatLab

Introduce : Seaborn is one of the world's most highly rated Python libraries built with the aim of creating beautiful visual images

● Some of the pre-built themes for creating styles for drawings in matplotlib:

● Tool to select color palettes to create beautiful graphs that represent patterns in your data

● Functions to display single and two-variable distributions or to compare them between subsets of data

● The tools are suitable and represent linear regression models for different types of independent and dependent variables

● Functions represent data matrices and use clustering algorithms to explore the structure in those matrices

● A function to rate timeseries statistics with flexible estimates and represent the uncertainty surrounding the estimate

● Allows you to easily construct complex images through high-level abstractions for the grid structure of plots.

Model Evaluation

Model evaluation is a crucial phase in developing a machine learning model, as it enhances our understanding of the model's performance and informs necessary adjustments to improve prediction accuracy Common metrics used for model evaluation include accuracy, precision, recall, and F1 score, which collectively provide insights into the model's effectiveness and reliability.

Accuracy measures the proportion of correct predictions to the total predictions made, but it may not provide an accurate assessment of a model's performance, particularly when dealing with imbalanced datasets.

The confusion matrix offers a comprehensive analysis of a model's performance by juxtaposing predicted values against actual outcomes It is comprised of four key elements: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

This matrix helps us better understand the errors encountered by the model and identify areas for improvement

Precision is the ratio between the number of correct predictions to the total number of predictions that the model considers correct Precision focuses on minimizing the

Recall is the ratio between the number of correct predictions to the total number are actually true Recall focuses on minimizing the number of false negatives [3]

The F1-Score is a crucial metric that represents the harmonic average of Precision and Recall, offering a balanced assessment when both factors need to be optimized It is particularly beneficial in scenarios involving skewed data, as it ensures that both Precision and Recall are taken into account for a comprehensive evaluation.

6 AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

AUC-ROC is a crucial metric for assessing the effectiveness of binary classification models, representing the area under the ROC curve An AUC value nearing 1 signifies strong model performance, whereas an AUC around 0.5 suggests the model's predictive ability is comparable to random guessing.

- False Positive Rate (FPR) Formula:

G-Mean is the measure used to evaluate the model on unbalanced data sets It calculates the geometric average of the True Positive Rate and True Negative Rate, which helps balance the performance between the layers

Explanatory variability measures the extent to which the model can explain the variability of the data

Formula: where y is the actual value and y^ is the predicted value

The Mean Absolute Error (MAE) quantifies the average absolute differences between predicted and actual values, serving as a visual indicator of the model's average error.

- Formula: where n is the number of samples, yi is the actual value, and y^ is the predicted value

MSE is the average of the squared errors between the predicted and actual values This metric is sensitive to exceptions, so large errors will have more influence

R² is a statistical metric that measures the extent to which independent variables account for the variability of dependent variables in a regression model

Formula: where yˉ is the average of the actual values

To comprehensively evaluate the performance of regression models, key metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R²), and Explained Variance are utilized In my research on regression, MAE and MSE offer valuable insights into prediction errors, while R² and Explained Variance assess the model's data interpretation capabilities This project employed these measurements to compare various models, ultimately aiding in the selection of the most optimal model for predicting real estate prices.

How to tune the model

Model tuning involves adjusting the parameters and hyperparameters of a machine learning model to enhance its performance This crucial process ensures that the model accurately fits the training data while maintaining strong generalization capabilities on unseen data The following steps outline the effective approach to model tuning.

1 Select the parameters to tuning:

- Parameters: These are values learned from data during training, such as weights in a neural network

- Hyperparameters: These are parameters that are set before the training process starts and are not learned from the data

2 Determine the range and value of the hyperparameters:

- Grid Search: Create a grid of possible values ,test each possible combination This is a comprehensive method but can be computationally expensive

- Random Search: Test random combinations of hyperparameters This method can be more time-saving than Grid Search and sometimes gives better results

- Bayesian Optimization: Uses Bayesian optimization techniques to pick out the best metaparameters based on previous tests

K-fold cross-validation is a robust technique that involves splitting the dataset into k subsets In this method, k-1 subsets are utilized for training the model, while the remaining subset serves as the test data This process is repeated k times, ensuring that each subset is used once as the test data, which helps to enhance the model's reliability and performance evaluation.

- Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold cross- validation with k equal to the number of data samples

4 Evaluate the performance of the model:

- Use metrics such as MAE, MSE, and R-squared to evaluate and compare the performance of different models

- Explained Variance: Measures how well the model can explain the data

- After performing the tuning process, select the model with the best performance based on the evaluation indicators

- Retest the selected model on an independent test dataset to ensure that it has good generalization

- After selecting the optimal model, deploy the model into the actual system

- Gather feedback from users and new data to continue refining and improving the model over time

In a real estate price prediction project, the model tuning process can include:

● Experiment with different algorithms such as linear, random forests, and neural networks

● Use the k-fold cross-validation technique to evaluate the performance of each model , select it with the best results

These detailed steps help ensure that the final model is not only accurate, but also stable and predictable on new data.

RESULTS

Overview

Zillow's Zestimate has transformed the U.S real estate landscape over the past 11 years by providing consumers with free, detailed insights into home values and market trends, enabling them to make informed buying and selling decisions.

Provide more accurate home value estimates: Enhance the accuracy of Zestimate so that consumers have more accurate information about real estate values

Enhanced Predictability: Develop algorithms capable of accurately predicting future home selling prices

Improve useful tools for consumers: Create a reliable and free tool for consumers to monitor and evaluate the value of their homes

Promoting the integration of machine learning (ML) and advanced data analytics in the real estate sector enhances market transparency and efficiency By leveraging these innovative technologies, the industry can improve decision-making processes and optimize property management, ultimately benefiting both buyers and sellers.

Solving the problem

The "poolcnt" feature serves as an index indicating the number of swimming pools present in a residential property When the "poolcnt" value is missing (NaN), it typically suggests that the property lacks a swimming pool To enhance clarity, it is advisable to replace the NaN value with 0, thereby clearly indicating the absence of a swimming pool in the property.

The "hashottuborspa" feature identifies the presence of a hot tub or spa in a home If the value is NaN, it typically indicates that there is no hot tub or spa, which can be updated to 0 for clarity Conversely, if the value is "true," it should be changed to 1 to confirm the existence of a hot tub or spa.

The features 'pooltypeid10', 'pooltypeid2', and 'pooltypeid7' denote specific types of pools or hot tubs associated with a property Information is only available when these values are not null; thus, we can substitute NaN values with 0 to signify the absence of a corresponding pool or hot tub for that house.

Also, since "pooltypeid10" provides the same information as "hashottuborspa", we can remove it from the dataset

❖ The feature 'fireplacecnt' indicates the number of fireplaces in a house, and the 'fireplaceflag' indicates whether the house has a fireplace To process data:

1 Replace the NaN value of 'fireplacecnt' with 0 to indicate that there is no fireplace in that house

2 With the 'fireplaceflag', replace the 'True' value with 1 and the NaN value with 0 to indicate that the house has or does not have a heater

3 If the number of fireplaces ('fireplacecnt') of the house is greater than 1, replace the value of the 'fireplaceflag' with 1 to indicate that the house has at least one fireplace

❖ Here's how the data is handled for the features

The 'taxdelinquencyflag' indicates the tax debt status of a property parcel, where a value of 'NaN' signifies the absence of tax debt information, while 'Y' denotes that property taxes have been overdue since 2015.

2 'structuretaxvaluedollarcnt' refers to the assessed value of the building structure on the lot, where the value of 'NaN' indicates no information

3 'landtaxvaluedollarcnt' describes the assessed value of the land area of the parcel, with the value 'NaN' also indicating no information

● Replace the 'NaN' values of 'taxdelinquencyflag', 'structuretaxvaluedollarcnt', and 'landtaxvaluedollarcnt' with '0' to indicate that there is no information for these features

This helps to clean and standardize the data, making it easier to analyze and process the data later

Here's how the data is handled for the features:

1 'finishedsquarefeet6': This is the area of the basic area that is unfinished and finished Since it rarely appears, and if there is, it is equal to

'calculatedfinishedsquarefeet', we will remove it

2 'finishedsquarefeet12': This is the area of the finished living area We're going to get rid of it because there's a lot of missing data

3 'finishedsquarefeet13': This is the perimeter of the completed living area

We will also remove it due to a lot of missing data

The 'finishedsquarefeet15' represents the total area of the house It has the lowest occurrence rate, with most values matching 'calculatedfinishedsquarefeet.' Therefore, we will fill the NaN values with the corresponding value from 'calculatedfinishedsquarefeet.'

The 'finishedsquarefeet50' represents the finished living area on the first floor of a house In cases where the house has only one floor, NaN values will be substituted with 'calculatedfinishedsquarefeet' For any remaining NaN values, they will be filled with the average value of this feature.

The bar chart illustrates the quantity of missing values across each column in a dataset To address columns with a significant number of missing values, one effective strategy is to eliminate those that surpass a predetermined threshold.

For columns with a missing value rate greater than 99%, we proceed to remove Here's how we can do this: storytypeid, basementsqft, yardbuildingsqft26

Feature creation involves creating new features or modifying existing features to improve the model's performance

The 'taxamount' attribute, representing the total property tax for the assessment year, can often be omitted to avoid redundancy, mitigate multicollinearity, and simplify the model By removing this attribute, the model becomes more stable and efficient, ultimately conserving processing resources, especially when faced with numerous missing values.

Model Initiation: Several models (LinearRegression, Decision Tree Regressor,

Random Forest Regressor) are initialized

Data Division: data set is divided into a training set , a test set by train_test_split, in this code, I divide the training rate into 80%, the test set is 20% => test_size=0.2

Model Training: The model is trained on functional training data

Prediction: Prediction is made on test data

Metric Calculation: The performance of the model is evaluated using MSE and

Results: Results, including the trained model and evaluation metrics, are returned and printed

III Run the prediction models and the prediction accuracy results

The result of evaluating three different models on the "prop 2016 data" dataset, with the number of rows and columns provided

- prop 2016 data: This is the name of the dataset used for training and evaluating models This dataset has 2985217 rows and 43 columns

Linear Decision Tree Random Forest

The results demonstrate the models' performance on the evaluated dataset, providing insights into their predictive capabilities This assessment is crucial for identifying the most suitable model for the specific problem at hand.

● MAE: The average error between the predicted value and the actual value

● MSE: The average squared error between the predicted value and the actual value

● ME: The largest error between the predicted value and the actual value

● R^2: The coefficient of determination is a measure to evaluate the suitability of the model

● Explained Variance: is measure that indicates the extent to which the model explains the variability

● RMSE (Root Mean Squared Error) is one of the two main performance indicators for a regression model

Linear regression demonstrates the lowest Mean Absolute Error (MAE), yet its low R^2 and Explained Variance suggest it may not adequately represent the data In contrast, the Decision Tree Regressor exhibits higher MAE and Mean Squared Error (MSE), along with a negative R^2, indicating a lack of consistency with the dataset Meanwhile, the Random Forest model shows relatively low MAE and MSE but also presents a negative R^2, suggesting that additional efforts are needed to enhance its performance.

Results of model improvement

To optimize model performance, I utilize the "n_estimator" and "random_state" parameters during tuning Adjusting these values effectively leads to improved model outcomes and enhanced results.

In the course of this thesis, I have achieved important results, from proactively collecting and analyzing data to applying pre-treatment techniques and developing a real estate price prediction model

The first and most important step in developing a ML model is data collection I have proactively approached different data sources to gather information on real estate prices Data sources include:

Using web scraping tools, I gathered data from popular real estate websites, which offer comprehensive details about properties, including their size, age, and location.

- Data from public reports and databases: I also use data from real estate market reports and public databases to supplement and validate data collected from the web

Proactively collecting data helps me get a diverse and rich dataset that ensures high accuracy and reliability

After collecting the data, next step is to analyze , preprocess the data This process includes:

- Initial data analysis: I use data visualization tools such as matplotlib and seaborn to analyze the characteristics of the data, identify trends, and relationships between variables

- Handling missing values: Missing values in the data can cause bias in the model I used techniques such as replacing the missing value with depending on each specific data column

- Data normalization and standardization: To ensure all variables have the same range of values and do not affect the model, I have applied data normalization and standardization techniques

- Classification variable coding: Classification variables are converted into variables through techniques such as one-hot coding so that machine learning

4.3 Understanding and Applying Machine Learning Algorithms

In this section, I have researched and gained a good understanding of the common

ML algorithms used to predict real estate prices The algorithms include:

- Linear Regression: Simple but effective

- Decision Tree: A powerful, easy-to-understand, and explainable nonlinear model

- Logistic Regression: Used for binary classification problems, extended to handle real estate price prediction problems by groups

- Artificial Neural Networks: Complex and powerful models capable of learning from big and non-linear data

4.4 Parameter Calibration and Model Development

Calibrating model parameters is crucial for achieving optimal performance in machine learning Techniques like Grid Search and Random Search are employed to identify the best parameters for these models This systematic approach ensures that the model is fine-tuned for accuracy and efficiency.

- Select the parameters to calibrate: For each model, I define the key parameters that affect the model's performance

- Search for optimal parameters: Use Grid Search and Random Search to test different parameter combinations and pick the best set of parameters

- Model evaluation: Use measurements such as MAE, MSE, RMSE, and R² to evaluate the performance of regression models For the classification models,

4.5 Best Model Review and Selection

Based on the model evaluation measurements, I choose the model with the best performance to apply to the prediction of real estate prices Selection criteria include:

- Accuracy: The model has the highest accuracy in predicting real estate prices

- Explainability: The model has the ability to

- Stability: The model has good generalization, is not overfitting, and has stable performance across different test data sets

4.6 Applying the real estate price prediction model

I successfully implemented an optimal model for predicting real estate prices The findings indicate that the model effectively forecasts property values by analyzing key factors, including the area, number of bedrooms, house age, and location.

During my dissertation, I developed skills in proactive data collection and analysis, applied preprocessing techniques, and gained a deep understanding of machine learning algorithms I also learned how to calibrate parameters and evaluate models comprehensively As a result, I created a highly accurate real estate price prediction model that delivers reliable predictions and aids decision-making in the real estate industry.

4.7 Using machine learning to predict real estate prices raises a number of ethical issues and challenges

Bias and fairness: Machine learning models can unintentionally learn and perpetuate biases present in historical data, leading to unfair predictions that can disadvantage certain groups or regions determined

Transparency: The black-box nature of some machine learning models can make interpretation of predictions difficult, leading to a lack of transparency and trust from stakeholders

Privacy: Using detailed personal or transactional data to train models raises privacy concerns, requiring strict data protection measures

Market Manipulation: Accurate predictions can be exploited by investors to manipulate market prices, which can lead to bubbles or market crashes

Regulatory compliance: Ensuring machine learning models comply with local and international regulations, such as data protection laws and fair housing regulations, is important

Addressing these ethical issues requires applying fairness-aware machine learning techniques, enhancing model explainability, ensuring strong data protection practices, and compliance with regulatory standards

CONCLUSIONS AND DEVELOPMENT DIRECTIONS

In this course, I utilized machine learning techniques to accurately predict real estate prices The research and implementation process was thorough, allowing me to systematically achieve the set objectives The findings demonstrate that machine learning models can effectively forecast real estate prices, particularly when employing data preprocessing methods and selecting appropriate models.

The main steps I took include:

I have gathered data from multiple sources, such as real estate websites and public databases, and subsequently processed and standardized it to guarantee quality and reliability.

- Model development: I have applied a variety of machine learning models, from simple models such as linear regression to more complex models such as neural networks and decision trees

- Model evaluation: I used performance metrics such as MAE, RMSE, and R² to compare and select the most optimal model

Research indicates that advanced models like Random Forest and Gradient Boosting outperform simpler models in predictive accuracy Moreover, incorporating various data sources and key factors influencing real estate prices—such as geographic location, available utilities, and market trends—significantly enhances model precision.

My real estate prediction machine learning model is currently limited to terminal-based predictions, necessitating users to possess programming skills and familiarity with execution environments This complexity makes it challenging for many individuals to access and utilize the model effectively.

In the future, I aim to transform this model into a user-friendly web application, enabling users to effortlessly input data, view predicted outcomes, and engage with the system without needing technical expertise Transitioning to a web platform will not only broaden the user base but also enhance the convenience and efficiency of utilizing the real estate prediction model.

This will contribute to improving the user experience and wider application in the real estate sector

Based on the results achieved, I propose some future development directions to improve the predictive performance and expand the scope of application of this research:

To enhance the breadth of my research, I am currently concentrating on the New York real estate market In the future, I plan to extend my analysis to additional markets, including Los Angeles, San Francisco, and international regions This approach will allow for the validation of the model's applicability and enable necessary adjustments to cater to the unique characteristics of each area.

- Testing advanced algorithms: I will research and apply more advanced machine learning algorithms such as Deep Learning and Reinforcement Learning to explore the potential to improve predictive efficiency

To enhance model accuracy and reliability, it is essential to integrate a diverse range of data sources, including economic trends, housing policies, infrastructure information, and various social factors By enriching datasets with this comprehensive information, we can significantly improve the overall quality of the analysis.

I am developing an online software tool that enables users to input real estate information and receive instant value predictions This innovative system provides users with a comprehensive overview of property values, aiding them in making informed buying and selling decisions.

To ensure the system operates efficiently and meets user needs, I will gather feedback for ongoing evaluation and improvement Regular maintenance and updates of the model will enable the system to adapt effectively to shifts in the real estate market.

This research not only fulfills its initial objectives but also advances the field of real estate price prediction, paving the way for numerous future opportunities and applications.

Tiêu đề	Applying machine learning methods to predict real estate prices
Tác giả	Nguyen Thi Thuy Dung
Người hướng dẫn	Dr. Truong Cong Doan
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Management Information System
Thể loại	Graduation project
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	73
Dung lượng	2,09 MB