Luận văn apply random forest technique for prediction problem

2.1.1.2 Input and Output of Decision Tree The input and output of a decision tree depend on the type of the task the tree isbeing used for: - For Regression tasks: ∗ Input: A collection

Overview

When it comes to customer demand, Steve Jobs famously expressed the need for Apple to anticipate customers’ desires before they even knew it themselves, by ” reading things that are not yet on the page” Although predicting the future may seem impossible, anticipating demand is crucial for supply chain planning Demand forecasters aim to achieve this by utilizing predictive analysis techniques to extract insights from sales data to predict customers’ future needs.

However, generating predictions is only the first step Successful demand forecasting requires incorporating those insights into decision-making about product direction, pricing, company expansion, and hiring, and avoiding the pitfall of merely striving for faster horses.

This topic will delve into demand forecasting, including techniques, benefits, and examples, with one such technique being the random forest algorithm.

Content and purpose

Content

A prediction problem is a type of problem in machine learning and data science where the goal is to predict an output variable or a response variable based on one or more input variables or features.

In a prediction problem, the model is trained on a set of input-output pairs called a training set, and the goal is to use this model to predict the output for new, unseen inputs that the model has not been trained on The quality of the predictions is evaluated using a set of metrics that measure the accuracy or the goodness of fit of the model.

Examples of prediction problems include:

- Predicting the price of a house based on its features such as location, size, number of rooms, etc.

- Predicting a customer’s probability to churn based on their behavior and demographics.

- Predicting the likelihood of a patient developing a disease based on their medical history and lifestyle factors.

- Predicting the next word in a sentence based on the previous words.

- Predicting the sentiment of a text review (positive or negative) based on its content.

- Prediction problems are common in many fields, including finance, healthcare, marketing, and natural language processing, among others.

Due to this concept, theIntroductionwill give overall about a type of prediction problem which isDemand forecast.

Demand forecasting is a process of predicting the future demands for a product or service Businesses utilize it as a crucial component of supply chain management to predict the quantity and time of the goods or services that customers will purchase in the upcoming.

Demand forecasting is based on historical data and statistical methods, and it takes into account various factors such as past sales, market trends, seasonal patterns, customer behavior, and external factors such as economic conditions and competitors.

The main goal of demand forecasting is to help businesses make informed decisions about production planning, inventory management, pricing, and marketing Businesses may manage their supply chains and prevent stock outs or overstocks, which can lead to lost sales, waste, and higher expenses, by correctly forecasting customer demand. Demand forecasting uses a variety of methods, including as time-series analysis, regression analysis, and machine learning algorithms like neural networks and decision trees The type of data, the necessary level of precision, and the complexity of the issue all influence the technique selected.

There are some of the common types of demand forecasting methods include:

These methods rely on subjective inputs and expert opinions to make a forecast. Examples of qualitative methods include surveys, market research, and Delphi method.

These methods use historical data to identify patterns and trends in the data and make predictions based on the historical patterns Examples of time-series methods include moving averages, exponential smoothing, and ARIMA (Autoregressive Integrated Moving Average) models.

These methods use advanced statistical techniques and algorithms such as neural networks, random forests, and support vector machines to make predictions based on the historical data.

These methods use the relationships between the demand and other factors such as price, marketing campaigns, economic indicators, and demographics to make predictions Examples of causal methods include regression analysis, econometric models, and multiple regression analysis.

The type of data, the amount of precision needed, and the complexity of the issue all influence the approach selection It is crucial to combine various techniques in order to increase forecast accuracy and reduce the possibility of mistakes.

1.2.1.3 Implementation process of Demand Forecast

Demand forecasting implementation procedures vary depending on the type of method employed, the industry, and the business requirements Nonetheless, some typical milestones in the demand forecasting implementation process include

- Define the problem: The first step is to clearly define the problem and the business needs This involves identifying the scope of the forecasting, the data requirements,and the objectives of the forecasting.

- Data collection and analysis: The next step is to collect and analyze the relevant data This involves identifying the sources of data, cleaning and preprocessing the data, and analyzing the data to identify patterns and trends.

- Model selection: Based on the data analysis, the appropriate forecasting method is selected The choice of method depends on the nature of the data, the level of accuracy required, and the complexity of the problem.

- Model development: The selected method is then used to develop a forecasting model This involves specifying the parameters of the model, training the model on the historical data, and validating the model using cross-validation techniques.

- Forecast generation: Once the model is developed and validated, it is used to generate forecasts The forecast can be generated for different time horizons, such as daily, weekly, or monthly.

- Performance evaluation: The performance of the forecasting model is evaluated using various metrics such as mean absolute percentage error (MAPE), mean squared error (MSE), and correlation coefficient (R) The performance evaluation helps to identify the accuracy of the forecast and the areas for improvement.

- Continuous improvement: Based on the performance evaluation, the forecasting model is refined and improved This involves incorporating new data, updating the parameters of the model, and testing the model on new data.

The implementation process of demand forecasting is an iterative process that requires continuous monitoring and improvement to ensure that the forecasts are accurate and reliable.

1.2.1.4 Advantages and disadvantages of Demand forecast

- Helps in effective planning: Demand forecasting provides valuable insights into the future demand of products or services, which helps businesses to plan their production, inventory, and distribution accordingly.

- Better resource utilization: With accurate demand forecasting, businesses can optimize their resource utilization and minimize wastage of resources.

- Improved customer satisfaction: By predicting the demand for their products or services, businesses can ensure that they meet the customer’s needs and expectations, which improves customer satisfaction.

- Increased efficiency: Demand forecasting gives companies the ability to streamline operations and boost productivity, which lowers costs and increases profits.

- Competitive advantage: By accurately predicting the demand for their products or services, businesses can gain a competitive advantage over their competitors.

Purpose of Demand Forecast

The main purposes of demand forecasting are:

- Production planning: Demand forecasting helps in planning production schedules, inventory management, and procurement of raw materials By forecasting demand, businesses can avoid stock outs or excess inventory, which can lead to increased costs or lost sales.

- Sales and marketing: Demand forecasting helps businesses to plan their sales and marketing strategies It provides insights into consumer behavior and trends, which can be used to develop effective marketing campaigns and promotions.

- Financial planning: Demand forecasting helps businesses to plan their financial resources, such as capital investments, cash flow, and budgeting It allows them to forecast revenue and expenses and plan accordingly.

- Resource allocation: Demand forecasting helps businesses to allocate their resources effectively It helps in determining the amount of labor, machinery, and other resources required to meet the expected demand.

- Risk management: Demand forecasting helps in risk management by identifying potential risks and uncertainties that may affect demand By anticipating changes in demand, businesses can take proactive measures to mitigate risks and minimize the impact on operations.

- Supply chain management: Demand forecasting helps businesses to manage their supply chain more effectively By forecasting demand, they can plan their procurement and transportation schedules, optimize inventory levels, and improve overall supply chain efficiency.

Overall, demand forecasting is an important tool for businesses to make informed decisions,optimize operations, and improve their bottom line.

Scope

- This report only focus on theory aboutPrediction Problem specificallyDemand Forecast,Decision Tree,Random Forest.

- InImplementation and evaluation of results, we will focus on building and improving model by Random Forest algorithm.

- Besides, it also compares to Scikit-Learn Random Forest Algorithm Through, we will find some ways to upgrade the accuracy of Random Forest algorithm.

Structure of report

The structure of the thesis is divided into 5 main chapters, including the following contents:

- Chapter I - Introduction: Overview, Topic content and purpose, Topic scope and

- Chapter II - Theoretical foundations and related researches: Introducing fundamental knowledge that will be included in the thesis including Decision Tree, Random Forest and other algorithms such as K-Means, Naive Bayes, KNN and Prophet.

- Chapter III - Proposed methods for prediction problem: Providing an overview of stocks, as well as preliminary evaluations of algorithms previously researched and select the appropriate algorithm.

- Chapter IV - Implementation and evaluation of results: Presenting the experiments, the steps of data preprocessing, model training, model evaluation methods, and compare the tested results of each model.

- Chapter V - Conclusion: Summarizing the main ideas of the thesis, analyze the strengths and weaknesses of the trained models, propose a suitable model for predicting stock prices, and feasible solutions that can be accessed in the future, which can be applied well in practice.

2 Theoretical foundations and related researches

Theoretical foundations

Decision Tree

In many circumstances in life, we observe, reflect, and decide by asking questions. From there, a model for machine learning that consists of questions that are grouped in the shape of a tree is created That is the decision tree model.

A decision tree is a hierarchical model used for making decisions or predictions based on a set of conditions or rules It is a type of supervised learning algorithm that is commonly used in machine learning and data mining.

Both categorical and continuous input features can be handled by decision trees, and they are simple to grasp and interpret They can also deal with noisy data and missing values Nonetheless, decision trees are prone to overfitting problems, particularly when the tree is very complicated or the input contains a lot of noise or irrelevant information. Techniques like pruning and ensemble methods, like Random Forest, can be utilized to overcome this issue.

2.1.1.2 Input and Output of Decision Tree

The input and output of a decision tree depend on the type of the task the tree is being used for:

∗ Input: A collection of training examples that are each associated with a set of input features and a matching numerical output value serve as the decision tree’s input. This input is used by the decision tree method to create a tree structure that can forecast new numerical values depending on the input features.

∗ Output: For a regression job, the decision tree’s output is a predicted numerical value for a certain collection of input features.

∗ Input: A group of training examples containing a variety of input attributes and a corresponding class label are the decision tree’s input This information is used by the decision tree algorithm to create a tree structure that can categorize fresh samples according to the input features.

∗ Output: A predicted class label for a specific collection of input features is the result of the decision tree for a classification problem.

In both classification and regression tasks, the input features can be categorical or continuous variables The decision tree algorithm uses a set of rules or criteria to split the data into subsets based on the input features.

The tree structure is built recursively by splitting the data at each node based on the best feature and threshold that minimize a cost function such as information gain or mean squared error Once the tree is built, it can be used to make predictions for new examples by traversing the tree from the root to a leaf node based on the input features.

Table 2: Types of Decision Tree

Input Categorical, i.e variables with discrete values such as yes/no, true/false, or red/green/blue.

Continuous, i.e variables with numerical values such as height, weight, or temperature.

Output The decision tree algorithm splits the data based on the values of the input features, and the goal is to maximize the separation of the classes at each node The output of the tree is a predicted class label for a given set of input features.

The decision tree algorithm splits the data based on the values of the input features, and the goal is to minimize the variance of the output values within each subset The output of the tree is a predicted numerical value for a given set of input features.

2.1.1.4 Important Terminology and Structure of Decision Trees

Here are some important terminology in decision trees:

- Root Node: The decision tree’s root node is the node at the top It serves as a representation of the whole population or sample and is subsequently broken down into sub-nodes using a splitting criterion.

- Splitting: the action of dividing a node into two or more sub-nodes according to a predetermined criterion The input feature that provides the greatest separation between the target classes or reduces the variance of the target variable is used to determine the split.

- Branches: A branch represents a sub-space or sub-group of the parent node that is created after splitting.

- Decision Nodes: An input characteristic and a decision rule for data splitting

- Leaf Nodes: A leaf node represents the final outcome of the decision process It contains the predicted class label or numerical value for the given input features.

- Pruning: Pruning is the process of reducing the size of the decision tree by removing unnecessary branches or nodes that do not contribute to the accuracy of the predictions.

- Impurity: Impurity measures the homogeneity of the target variable in a set of examples The goal of the decision tree algorithm is to minimize the impurity of the subsets at each node, such as entropy or Gini index for classification trees, and mean squared error or variance reduction for regression trees.

- Information Gain: Information gain measures the reduction in impurity achieved by a split The decision tree algorithm selects the input feature that provides the highest information gain to create a new decision node.

- Overfitting: Overfitting occurs when the decision tree is too complex and captures noise or irrelevant patterns in the training data It leads to poor performance on new data and can be avoided by pruning the tree or using regularization techniques.

Figure 1: Structure of Decision Tree

Accorrding to picture 1, the structure of a decision tree consists of nodes and branches. The nodes represent the input features and decision rules, while the branches represent the decision outcomes The tree structure starts with a root node that represents the entire dataset The root node is then split into sub-nodes based on the values of a selected input feature Each sub-node represents a subset of the data that shares similar characteristics. The splitting process continues recursively until a stopping criterion is met, such as a maximum depth or a minimum number of examples per node.

Random Forest Algorithm

Random Forest is a machine learning algorithm that combines multiple decision trees to create a powerful ensemble model The algorithm creates a large number of decision trees on random subsets of the training data and features and then aggregates the predictions of all the trees to make a final prediction By combining the predictions of multiple trees, Random Forest is able to improve the accuracy and stability of individual decision trees.

It is commonly used for classification and regression tasks and is known for its ability to handle high-dimensional data with many features Random Forest also has the ability to estimate feature importance and requires minimal hyperparameter tuning, making it a popular choice among machine learning practitioners.

2.1.2.2 Input and Output of Random Forest

∗ Input: The input of a Random Forest algorithm is typically a dataset containing both input features and a target variable The input features are used to train the model and make predictions, while the target variable represents the variable that the algorithm aims to predict.

- In supervised learning tasks, the target variable is known for each instance in the training data and is used to train the model.

- In unsupervised learning tasks, there is no target variable, and the algorithm aims to find patterns or relationships in the data.

∗ Output: The output of a Random Forest algorithm depends on the type of problem being solved.

- For classification tasks, the algorithm outputs a probability or a predicted class label for each instance in the input data.

- For regression tasks, the algorithm outputs a continuous numerical value for each instance in the input data.

In both cases, the output of the Random Forest algorithm is a prediction of the target variable based on the input features Additionally, Random Forest can also provide estimates of feature importance, which can be used to identify the most relevant features for predicting the target variable.

2.1.2.3 Working of Random Forest Algorithm

The Random Forest algorithm works by combining multiple decision trees to create a powerful ensemble model The steps involved in the working of the algorithm are as follows:

- Step 1: The data is randomly sampled with replacements from the original dataset, creating multiple subsets of the data.

- Step 2: For each subset of the data, a decision tree is constructed by selecting the best split at each node based on a randomly selected subset of the features.

- Step 3: The process of creating subsets of the data and constructing decision trees is repeated to create a large number of trees in the forest.

- Step 4: When making a prediction for a new instance, each tree in the forest independently predicts the target variable based on the input features The predicted values from all the trees are then aggregated to make a final prediction.

- Step 5: The aggregation method depends on the type of problem being solved. For classification tasks, the mode of the predicted class labels is taken as the final prediction For regression tasks, the average of the predicted values is taken as the final prediction.

- Step 6: The Random Forest algorithm improves individual decision trees by reducing the risk of overfitting the training data and increasing the diversity of the forest The random selection of features and data subsets helps to reduce the correlation between trees and improve the overall performance of the model Additionally, Random Forest can estimate feature importance and handle high-dimensional data with many features, making it a popular choice for many machine learning applications.

Looking at the picture 2 below, this is the working of RF Algorithm described above

Ensemble techniqueis a machine learning approach that combines multiple models to improve the predictive accuracy and stability of the overall model It is based on the idea that multiple models can work together to overcome the limitations of individual models, leading to better performance and more robust predictions.

There are 3 types of this method:

- Bagging: This technique involves training multiple models on different subsets of the training data, with replacement The predictions of all the models are then averaged to make a final prediction Bagging can improve the stability and reduce the variance of the model.

- Boosting: This technique involves training multiple models sequentially, where each model is trained on the errors of the previous model The final model is a weighted combination of all the models, with more weight given to the better-performing models Boosting can improve the accuracy and reduce the bias of the model.

- Stacking: This technique involves training multiple models and then using their predictions as input features for a meta-model The meta-model then combines the predictions of all the models to make a final prediction Stacking can improve the performance and generalization of the model.

The picture below imitates howBaggingandBoostingdo.

Ensemble techniques can be applied to various types of machine learning models, including decision trees, neural networks, and support vector machines They are commonly used in industry and research to achieve state-of-the-art performance in many machine learning tasks, such as image classification, natural language processing, and recommendation systems.

However, this part will concentrate onBaggingbecause it relates toRandom Forest.

Bagging, also known as Bootstrap Aggregation is a technique used in machine learning to reduce the variance of an estimator by averaging or aggregating the results from multiple models trained on different samples of the data In the context of Random Forest, bagging is used to build an ensemble of decision trees by training each tree on a different bootstrap sample of the original training data.

Random Forest is an extension of the bagging technique where each decision tree is trained on a random subset of features, which helps to decorrelate the trees and reduce overfitting Specifically, at each node of the tree, a random subset of features is selected as candidates for splitting the data This means that each tree in the forest will have different splits, which further helps to reduce the correlation between the trees and improve the overall performance of the model.

The combination of bagging and feature randomness in Random Forest makes it a powerful and widely used algorithm for classification and regression tasks The key benefits of Random Forest include its ability to handle high-dimensional data with complex interactions, its robustness to outliers and missing values, and its ability to provide measures of feature importance for interpretability.

The image 4 below shows howBaggingworks:

2.1.2.4 Important Features of Random Forest

Random Forest has several features that make it a popular and effective machine learning algorithm, including:

Related researches

K-Means Algorithm

The K-means algorithm is a popular unsupervised machine learning algorithm used for clustering tasks It is used to group similar data points into clusters based on their similarity.

The algorithm works by dividing a dataset into K clusters, where K is a user-defined parameter The algorithm then iteratively assigns each data point to the nearest cluster center, or centroid, based on a distance metric such as Euclidean distance After all data points have been assigned to a cluster, the centroid of each cluster is updated based on the mean of the data points in that cluster.

This process is repeated until the centroids no longer change or a maximum number of iterations is reached The result of the algorithm is a set of K clusters, each represented by its centroid, which can be used for further analysis or visualization.

2.2.1.2 Input and Output of K-Means Algorithm

∗ Input: The input to the K-means algorithm is a dataset containing N data points and a user-defined parameter K, which specifies the number of clusters to be formed The dataset can be represented as a matrix X with dimensions N x D, where N is the number of data points and D is the number of features or attributes for each data point.

∗ Output:The output of the K-means algorithm is a set of K clusters, each represented by its centroid, and a set of assignments of data points to these clusters Specifically, the output includes the following:

- K cluster centroids, which are the mean values of the data points assigned to each cluster.

- A set of assignments of each data point to one of the K clusters based on the closest centroid.

- The within-cluster sum of squares (WCSS), which is the sum of the squared distances between each data point and its assigned centroid.

The K-means algorithm can also produce visualizations of the resulting clusters, such as scatterplots or heatmaps, to aid in interpretation and analysis of the results.

∗ The picture 5 below displays the using of the K-means Clustering Algorithm:

Figure 5: K-means Clustering Algorithm in Machine Learning

2.2.1.3 How the K-Means Algorithm works?

The working of the K-Means algorithm is explained in the following steps:

- Step-1: Select the number K to decide the number of clusters.

- Step-2: Select random K points or centroids (It can be other from the input dataset).

- Step-3: Assign each data point to their closest centroid, which will form the predefined

- Step-4: Calculate the variance and place a new centroid of each cluster.

- Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

- Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

- Step-7: The model is ready which means the assumed centroids will be removed, and the final clusters will be as shown.

∗ Picture 6 show us the flow chart of K-Means Algorithm

Figure 6: Flow Chart of K-Means Algorithm

2.2.1.4 Advantages and Disadvantages of K-Means Algorithm

- Simplicity: K-means is easy to understand and implement, making it suitable for clustering large datasets.

- Efficiency: K-means is a fast algorithm that can handle large datasets, making it useful in real-time applications.

- Scalability: K-means can handle a large number of variables and observations.

- Flexibility: K-means can be adapted to different types of data and distance metrics.

- Sensitivity to initialization: K-means is sensitive to the initial choice of centroids and can get stuck in local optima, producing different results for different initializations.

- Number of clusters: The number of clusters in K-means must be specified beforehand, which can be difficult to determine in advance.

- Assumption of spherical clusters: K-means assumes that the clusters are spherical and have similar sizes, which may not be true for all datasets.

- Outliers: K-means is sensitive to outliers, which can greatly affect the results of clustering.

Overall, the K-means algorithm is a useful and popular method for clustering large datasets, but it is important to be aware of its limitations and to carefully consider the appropriate use case for each application.

K-Means algorithmis very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc.

- Image segmentation: K-means is used to segment images into different regions based on color or texture similarities Market segmentation: K-means is used to segment customers into different groups based on their preferences, buying habits, demographics, and other factors.

- Anomaly detection: K-means can be used to detect outliers or anomalies in datasets that do not fit the general pattern Recommendation systems: K-means can be used in collaborative filtering to recommend products or services based on the similarity of users’ preferences.

- Clustering in bioinformatics: K-means is used to cluster genes, proteins, or other biological data to identify patterns or relationships.

- Text clustering:K-means is used to cluster similar documents based on their content or topics.

- Traffic flow analysis: K-means can be used to cluster traffic patterns in order to optimize traffic flow or identify congestion hotspots.

- Machine learning: K-means is often used as a pre-processing step for machine learning algorithms such as neural networks or decision trees.

These are just a few examples of the many applications of the K-means algorithm Its simplicity, efficiency, and flexibility make it a popular method for data analysis and machine learning.

∗ Comparison between K-Means and Random Forest

K-Means and Random Forest are both popular machine learning algorithms, but they are used for different purposes and have different characteristics.

Table 4: Comparison between K-Means and Random Forest

Purpose Unsupervised learning algorithm that is used for clustering data into groups or segments based on similarity.

A supervised learning algorithm that is used for classification and regression tasks.

Input data A numeric dataset with a specified number of clusters to be identified.

A dataset with features and labels for each instance.

Output A the cluster assignments for each data point.

A prediction or classification for each instance.

Performance A K-Means is known for its scalability and speed, making it suitable for large datasets.

Random Forest is more computationally expensive and may not be suitable for very large datasets.

Interpretability Easy to interpret and understand, as it groups data into clusters based on similarity.

A complex model with many decision trees, making it less interpretable.

In summary, K-Means and Random Forest are both useful machine learning algorithms,but they are used for different purposes and have different strengths and weaknesses.K-Means is suitable for clustering data and is easy to interpret, while Random Forest is more accurate for classification and regression tasks but is less interpretable.

Prophet Algorithm

Prophet is a time series forecasting algorithm, which is an open-source algorithm published by the Core Data Science team at Facebook in 2017.

It is designed to be fast and accurate for large-scale time series datasets Prophet uses an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects It also accounts for changes in trend over time, as well as outliers and other effects.

The algorithm is highly customizable and can handle missing data and robustly handle outliers.

2.2.2.2 Input and Output of Prophet Algorithm

∗ Input: The input to the Prophet algorithm is a historical time series dataset that includes a timestamp column and at least one numerical column representing the variable to be forecasted The timestamp column should be in a specific format, typically in the format of YYYY-MM-DD for daily data The data can be in either a CSV or Pandas DataFrame format.

∗ Output: The output of the Prophet algorithm is a forecast of the future values of the input time series, along with a range of uncertainty intervals The forecast can be generated for any desired number of future periods, which can be specified as a parameter in the algorithm Additionally, the output includes various visualizations of the forecast, such as a plot of the predicted values, trend, and seasonality components.

Its core is the sum of three functions of time plus an error term: growth g(t), seasonalitys(t), holidaysh(t), and errorϵ t ′ The core idea is based around the structural decomposition: y(t) = g(t) +s(t) +h(t) +ϵ t ′

• y(t):is the time series value at time t

• g(t): piecewise linear or logistic growth curve for modelling non-periodic changes in time series

• s(t): periodic changes (e.g weekly/yearly seasonality)

• h(t): effects of holidays (user provided) with irregular schedules

• ϵ t ′ : error term accounts for any unusual changes not accommodated by the model

∗ Picture 7 shows flow chart of Prophet Algorithm

Figure 7: Flow Chart of Prophet Forecasting Model

The growth function models the overall trend of the data The old idea should be familiar to anyone with a basic knowledge of linear and logistic functions The new idea incorporated into Facebook prophet is that the growth trend can be present at all points in the data or can be altered at what Prophet calls “changepoints”.

Changepoints are moments in the data where the data shifts direction These changepoints are automatically selected However, a user can also feed the changepoints manually if it is required In the below plot, the dotted lines represent the changepoints for the given time series.

The seasonality function is simply a Fourier Series as a function of time Seasonal effectss(t)are approximated by the following function: s(t) X N n=1 a n cos

• P is the period (365.25 for yearly data and 7 for weekly data)

• Parameters[a 1 , b 1 ,… , a N , b N ]need to be estimated for a given N to model seasonality.

• The fourier order N that defines whether high frequency changes are allowed to be modelled is an important parameter to set here For a time series, if the user believes the high frequency components are just noise and should not be considered for modelling, he/she could set the values of N from to a lower value If not, N can be tuned to a higher value and set using the forecast accuracy.

Holidays and events incur predictable shocks to a time series It takes a list of dates and when each date is present in the forecast adds or subtracts value from the forecast from the growth and seasonality terms based on historical data on the identified holiday dates.

2.2.2.4 Advantages and Disadvantages of Prophet Algorithm

- Prophet is a flexible model that can handle a wide range of time series patterns, including trends, seasonality, and holiday effects.

- It can also incorporate user-provided external regressors, allowing for more accurate predictions.

- Prophet is relatively easy to use and can generate forecasts quickly.

- The model provides built-in visualization tools that make it easy to explore and understand the data and the model’s predictions.

- Prophet may not be as accurate as more complex models for certain types of time series data.

- The algorithm may require significant tuning of its hyperparameters to achieve the best results.

- It may not be suitable for very large datasets due to its computational requirements.

Prophet algorithm can be applied in various fields, including:

- Financial forecasting: Prophet can be used to predict stock prices, market trends, and other financial indicators.

- Demand forecasting: The algorithm can be used to forecast demand for products and services, allowing companies to optimize their inventory and supply chain management.

- Healthcare: Prophet can be used to forecast patient volumes and bed occupancy rates, helping healthcare providers better manage their resources.

- Weather forecasting: The algorithm can be used to forecast weather patterns, providing valuable information to industries such as agriculture and transportation.

- Marketing and sales: Prophet can be used to forecast sales volumes and customer behavior, allowing companies to better target their marketing and sales efforts.

- Social media analysis: The algorithm can be used to analyze social media trends and forecast user behavior, providing valuable insights to businesses and marketers.

∗ Comparison between Prophet and Random Forest

Prophet algorithm and Random Forest algorithm are two different approaches to forecasting and prediction tasks.

Table 5: Comparison between Prophet and Random Forest

A time-series forecasting algorithm, which is specifically designed to work with business data.

A machine learning algorithm that can be used for various types of prediction tasks, including regression and classification.

Based on decomposing time-series data into trend, seasonality, and holiday components and then applying statistical models to forecast future values.

Based on building multiple decision trees and aggregating their predictions to obtain more accurate results.

Is known for its ability to handle missing data, outliers, and sudden changes in trends.

Is known for its ability to handle high-dimensional data, complex relationships between variables, and noisy data.

Provides intuitive visualizations and diagnostic tools to evaluate the model’s performance.

Provides feature importance rankings to understand the relative importance of each input variable.

In summary, while Prophet algorithm is specifically designed for time-series forecasting tasks and has features tailored for business applications, Random Forest algorithm is a more general-purpose machine learning algorithm that can handle various types of prediction tasks.

Naive Bayes Algorithm

2.2.3.1 Definition of Naive Bayes Algorithm

Naive Bayes algorithm is a supervised learning algorithm based on the Bayes theorem, which is used for classification tasks It assumes that the presence of a particular feature in a class is independent of the presence of any other feature.

This assumption is called ”naive” because in the real world, features are often dependent on each other Despite this simplification, the algorithm has been shown to perform well in practice, especially when working with text classification problems.

The Naive Bayes algorithm calculates the probability of a data point belonging to a particular class based on the probability of the features in that class.

2.2.3.2 Input and Output of Naive Bayes

∗ Input:The input of Naive Bayes algorithm is a dataset containing a set of training examples, each of which consists of a set of features and their corresponding class labels. The features should be independent of each other, and their values can be categorical or continuous.

∗ Output: The output of the Naive Bayes algorithm is a probabilistic prediction of the class label for a new example based on its features Specifically, the algorithm calculates the probability of each class label given the values of the features of the new example, and then selects the class label with the highest probability as the predicted class label.

Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which is used to determine the probability of a hypothesis with prior knowledge It depends on the conditional probability.

The formula for Bayes’ theorem is given as:

• P(A|B) is Posterior probability: is the probability of event A occurring given that event B has occurred

• P(B|A) is Likelihood probability: is the probability of event B occurring given that event A has occurred

• P(A) is Prior Probability:Probability of hypothesis before observing the evidence.

• P(B) is Marginal Probability: Probability of Evidence.

Bayes’ Theorem is used in a variety of applications, including spam filtering, medical diagnosis, and machine learning algorithms such as Naive Bayes.

2.2.3.4 Types of Naive Bayes Model

There are three types of Naive Bayes Model:

- Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.

- Multinomial: The Multinomial Naive Bayes classifier is used when the data is multinomial distributed It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc The classifier uses the frequency of words for the predictors.

- Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables Such as if a particular word is present or not in a document This model is also famous for document classification tasks.

The Naive Bayes algorithm works by using Bayes’ theorem to calculate the probability of a hypothesis (in this case, a class label) given a piece of evidence (the input features). Bayes’ theorem can be represented as:

• P(h|e): is the probability of the hypothesis (h) given the evidence (e)

• P(e|h): is the probability of the evidence (e) given the hypothesis (h)

• P(h):is the prior probability of the hypothesis (h)

• P(e): is the prior probability of the evidence (e)

In the context of classification, the hypothesis is the class label and the evidence is the input features The algorithm assumes that the input features are conditionally independent given the class label, meaning that the presence or absence of one feature does not affect the presence or absence of another feature This simplifies the calculation of P(e|h) and allows for faster model training and prediction.

The algorithm works by first calculating the prior probability of each class label based on the training data Then, for a new instance with input features, it calculates the posterior probability of each class label given the evidence using Bayes’ theorem The algorithm assigns the new instance to the class label with the highest posterior probability.

The Naive Bayes algorithm is a probabilistic model and works well for datasets with a large number of features and relatively small amount of training data It is commonly used for text classification, spam filtering, and sentiment analysis.

∗ Picture 8 displays flow chart of Naive Bayes Algorithm

Figure 8: Flow Chart of Naive Bayes

2.2.3.6 Advantages and Disadvantages of Naive Bayes

- Naive Bayes is a simple and easy-to-understand algorithm It is easy to implement and requires minimal training data.

- It is efficient and can work well even with a small dataset.

- Naive Bayes can be used for both binary and multiclass classification problems.

- It can handle both continuous and discrete data.

- Naive Bayes can handle missing data, which is often an issue with other algorithms.

- It is highly scalable and can work well with high-dimensional datasets.

- Naive Bayes makes an assumption of independence between features, which is not always true in real-world scenarios.

- It can be negatively affected by irrelevant features in the dataset.

- Naive Bayes is known to be a bad estimator, meaning that it tends to overestimate or underestimate the probabilities.

- It cannot learn interactions between features, so it may not perform well on some complex datasets.

- Naive Bayes assumes that all features are equally important, which may not be true in some cases.

- It may not perform well when the training data is unbalanced or skewed.

Naive Bayes algorithm has a wide range of applications, including:

- Email filtering: Naive Bayes algorithm is used in spam filters to determine whether an email is spam or not.

- Text classification: Naive Bayes algorithm is used in natural language processing tasks such as text classification, sentiment analysis, and topic modeling.

- Medical diagnosis: Naive Bayes algorithm is used in medical diagnosis to classify a patient’s symptoms and determine the likelihood of a particular disease.

- Image recognition: Naive Bayes algorithm is used in image recognition tasks such as face detection, object recognition, and character recognition.

- Recommendation systems: Naive Bayes algorithm is used in recommendation systems to suggest products or services to users based on their preferences and past behaviors.

- Fraud detection: Naive Bayes algorithm is used in fraud detection systems to detect fraudulent activities such as credit card fraud, insurance fraud, and identity theft.

- Weather forecasting: Naive Bayes algorithm is used in weather forecasting to predict the likelihood of a particular weather condition based on historical data.

- Stock price prediction: Naive Bayes algorithm is used in stock price prediction to analyze market trends and determine the likelihood of a particular stock price movement.

∗ Comparison between Naive Bayes and Random Forest

Naive Bayes Algorithm and Random Forest Algorithm are two popular machine learning algorithms used for classification problems Here are some differences between the two:

Table 6: Comparison between Naive Bayes and Random Forest

Algorithm Type A probabilistic algorithm that calculates the probability of each class based on the input features.

A decision-tree-based algorithm that creates multiple decision trees and combines their outputs to make a final prediction.

Handling Missing Data Handle missing data by assigning probabilities based on available data.

Handle missing data by imputing the missing values using methods like mean imputation.

Feature Importance Doesn’t provide a measure of feature importance.

Provides an importance score for each feature used in building the trees.

Training Time Generally faster to train as it requires fewer iterations.

Performance Performs well when the assumption of independence between features holds true.

Performs well when there are complex interactions between features and a large number of data points.

Overall, Naive Bayes is a simpler algorithm and performs well on small datasets with fewer features, while Random Forest is more complex and performs well on larger datasets with many features.

K-Nearest Neighbors Algorithm

2.2.4.1 Definition of the KNN Algorithm

K-Nearest Neighbors (KNN) is a non-parametric and instance-based machine learning algorithm used for classification and regression tasks.

It is called non-parametric because it doesn’t make any assumptions about the probability distributions of the data, and it’s instance-based because the algorithm does not learn a model or parameters from the training data but instead stores all the training data points in memory and uses them for classification or regression of new data points.

2.2.4.2 Input and Output of the KNN Algorithm

∗ Input: The input of the K-Nearest Neighbors (KNN) algorithm consists of:

- a set of N data points, each point consisting of p features.

- An unlabeled query point, which is the point for which we want to predict the label.

∗ Output: depends on whether k-NN is used for classification or regression:

- In k-NN classification, the output is class membership An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among itsk nearest neighbors (k is a positive integer, typically small) If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

- In k-NN regression, the output is the property value for the object This value is the average of the values ofk nearest neighbors.

K-Nearest Neighbors (KNN) is a simple and intuitive algorithm used for classification and regression problems It works on the principle of finding the k-nearest data points in the feature space to a given input data point and using the labels of these k-nearest neighbors to predict the label of the input data point The working of KNN can be explained in the following steps:

- Step-1: Choose the value of k, the number of neighbors to consider.

- Step-2: Compute the distance between the input data point and all the data points in the dataset using a distance metric such as Euclidean distance or Manhattan distance.

- Step-3: Select the k-nearest neighbors based on the calculated distance.

- Step-4: For classification problems, assign the label of the input data point to the majority label of the k-nearest neighbors For regression problems, predict the value of the input data point based on the average of the values of the k-nearest neighbors.

- Step-5: Return the predicted label or value of the input data point.

This process can be repeated for each input data point in the dataset to obtain the complete set of predicted labels or values.

It is worth noting that the choice of distance metric and the value of k can have a significant impact on the performance of the KNN algorithm In general, a larger value of k results in a smoother decision boundary and a smaller value of k results in a more complex decision boundary that can capture more intricate patterns in the data.

∗ Picture 9 shows work flow of KNN Algorithm

Figure 9: Flow Chart of KNN

2.2.4.4 Advantages and disadvantages of the KNN algorithm

- Simple and easy to understand and implement.

- Can be used for both classification and regression tasks.

- No assumptions about data distribution are required.

- KNN can perform well with small or large datasets.

- Computationally expensive for large datasets, as it requires a lot of memory to store the entire dataset.

- KNN is sensitive to irrelevant features, which can negatively impact the accuracy of the model.

- The selection of the appropriate value for K is critical An inappropriate value of K can lead to poor performance.

- KNN is not suitable for high-dimensional data, as it becomes difficult to calculate the distance between data points in a high-dimensional space (known as the “curse of dimensionality”).

KNN algorithm is widely used in various fields, including:

- Image recognition: KNN can be used to recognize images based on their features.

- Text mining: KNN can be used for text classification, document retrieval, and content-based image retrieval.

- Recommender systems: KNN can be used to make personalized recommendations based on the user’s past behavior.

- Bioinformatics: KNN can be used to analyze gene expression data, protein-protein interaction, and other biological datasets.

- Marketing: KNN can be used to identify groups of customers with similar buying behavior.

- Fraud detection: KNN can be used to detect fraudulent transactions by identifying patterns of fraudulent behavior.

- Geographic information systems: KNN can be used to identify clusters of similar data points on maps.

- Medical diagnosis: KNN can be used to diagnose diseases based on a patient’s symptoms and medical history.

∗ Comparison between KNN and Random Forest

K-Nearest Neighbors (KNN) and Random Forest are both supervised machine learning algorithms used for classification and regression tasks However, there are some key differences between them:

Table 7: Comparison between KNN and Random Forest

Approach A KNN is a lazy learning algorithm, which means it doesn’t build a model during training Instead, it stores all the training data and uses it during the prediction phase.

Random Forest is an eager learning algorithm, which means it builds a model during training and uses it for predictions.

Model complexity A relatively simple algorithm, with no parameters to tune during training.

Has several hyperparameters that can be tuned to improve performance.

Interpretability A transparent algorithm, meaning it is easy to understand and interpret how the predictions are made.

A black box algorithm, meaning it is more difficult to understand how the predictions are made.

Data requirements Requires a larger amount of training data to make accurate predictions, as it relies on the similarity between data points.

Can work well with smaller datasets, as it uses a combination of multiple decision trees.

In summary, KNN is a simple, transparent algorithm that works well with large datasets,while Random Forest is a more complex, black-box algorithm that can work well with smaller datasets and can provide better accuracy with the right hyperparameter tuning.

3 Proposed methods for prediction problems

Problem specification

Proposed methods

Hardware and Dataset

Experiment and evaluate the results

Random Forest Model Improvement

Advantages and Disadvantages of the Proposed Method

Định dạng
Số trang	121
Dung lượng	4,12 MB