Implements an enhancement in the pre and post processing capabilities o fthe efdc explorer modeling system = phát triển nâng cao khả năng tiền xử lý và hậu xử lý của hệ thống mô hình hoá efdc explorer

Implements an enhancement in the pre-and post-processing capabilities o fthe EFDC Explorer Modeling System = Phát triển nâng cao khả năng tiền xử lý và hậu xử lý của hệ thống mô hình hoá EFDC_Explorer

Présentation générale de l’entreprise

Dynamic Solutions-International, LLC, known as DSI, is an engineering, planning, research, and software development company founded in the United States in 1999 Headquartered in Edmonds, Washington, DSI established its first representative office in Hanoi, Vietnam, in 2002 to offer targeted technical and management support to international and national development agencies, as well as international businesses.

DSI recognizes the critical importance of water and environmental policies and practices for sustainable development The organization contributes to the formulation of integrated water and environmental policies, regulations, and capacity-building initiatives on a global scale Through its holistic approaches, DSI assists clients in assessing project risks, potential impacts, and the likelihood of successful and sustainable implementation.

They offer training programs, seminars, and workshops designed to assist clients and partners in acquiring new skills related to the software and programs developed by DSI, specifically EFDC_Explorer and CVL Grid.

Présentation de l’équipe DSI et les bureaux du Vietnam

The DSI Vietnam team consists of around sixty experts, including twenty individuals from diverse nationalities assigned to various projects Significant changes occurred in the internal organization from the beginning to the end of the project Initially, we were divided into teams, each with a technical manager and specific roles These teams largely reflected the company's divisions, including home, personal, consumer, and enterprise groups, as well as cross-functional technical groups focused on access core networks and platform middleware, where I was integrated.

In June 2014, Orange Silicon Valley underwent a significant internal transformation, shifting from team-based work to a project-oriented approach This change involved the departure of technical directors who previously managed teams, allowing employees to take ownership of projects, which are now directly approved by the vice president and president.

All workstations are located in an open space, with only the CEO, VP, and HR director having personal offices The work atmosphere is both focused and relaxed, featuring no dress code and fostering simple, non-hierarchical relationships among colleagues A calm environment prevails, as noise in the open area can quickly become disruptive Our team communicates through instant messaging, even while sitting face-to-face, allowing us to discreetly share links to interesting articles, discuss projects, or schedule meetings When necessary, we move to one of the many meeting rooms or the communal kitchen, where free coffee and beverages are a staple, reflecting the culture throughout Vietnam.

Motivation

In the age of Big Data, time series data is generated in vast quantities due to the widespread presence of sensors and Internet of Things (IoT) devices that continuously produce information While the data collected by these devices is valuable and can yield significant insights, there is an increasing demand for the development of algorithms that can efficiently process this data.

In various fields such as economics, robotics, energy, security, meteorology, healthcare, and hydrology, it is crucial to analyze and monitor collected data This analysis enables short- and long-term forecasts, including investment returns, market penetration, production and inventory management, and predicting water levels during flood stages in pumping stations, reservoirs, lakes, and basins Such insights facilitate informed decision-making and strategic actions.

This work focuses on predicting water levels at various hydrological stations A key advantage of the EEMS water level forecasting system is the availability of monitoring data from these stations and the increasing number of sensors collecting information The goal is to enhance the EEMS modeling system by employing deep learning techniques to provide improved predictions and analyze environmental events effectively.

Statistical data analysis and machine learning are widely studied topics with strong connections to various fields The primary goal of machine learning is to detect and learn phenomena as humans do, or even better in some respects Researchers have employed both simple and complex decision models to achieve this objective Key aspects of machine learning studies include reduced human interaction and improved performance For simpler algorithms, a good and simplified representation of data is often essential However, without a deep understanding of the data and the model, this simplified representation can lead to significant challenges, particularly with neural networks, which are often viewed as black boxes In such cases, it is more effective to design a model that can learn data characteristics by extracting abstract features or transforming the data into a more compact form The simplest form of deep architecture, the perceptron, was defined in the early years of this field.

In the 1950s, multilayer structures with limited learning capabilities were introduced For over 40 years, these architectures have sought to evolve in response to specific demands, driven by advancements in information technology, machine learning, and signal processing, particularly in the context of temporal sequences.

CHAPITRE 1 INTRODUCTION portance des réseaux profonds 1 2 Sipes et al (2014) Alfred (2016)

Contexte et Problématiques

Contexte

The primary goal is to address the challenge of predicting water levels in parking areas using time series data We aim to employ machine learning techniques, specifically representation learning, to effectively tackle these issues.

Water level forecasting addresses the challenge of predicting water levels to inform decision-making EEMS is examined as a specific case of forecasting, aiming to estimate future observations based on historical knowledge Generally, forecasting involves interpreting historical data, which consists of a series of observations recorded at fixed intervals and organized chronologically.

Objectifs

The objectives of this internship include studying the current state of existing research on water level forecasting over time sequences This understanding will facilitate the proposal of an architecture that utilizes recurrent neural networks.

We propose a novel architecture for water level prediction using time series sequences The suggested architecture is a recurrent LSTM network designed specifically for forecasting time series data.

Problématiques

Forecasting water levels using time series data is crucial in fields like energy, meteorology, and hydraulics Advances in water level prediction primarily rely on supervised machine learning algorithms that require large labeled datasets for training Consequently, data collection in practical applications is challenging, costly, and demands domain expertise from specialists.

1 https://www.360logica.com/blog/benefits-machine-learning-big-data-analytics/

2 https://www.qubole.com/blog/big-data-machine-learning/

In this section, we will explore deep learning techniques applied to time series problems, highlighting their advantages and disadvantages Additionally, we will examine areas of interpretability related to various technical decisions.

Séries temporelles

Séries Temporelles (Time Series)

L’étude des séries temporelles, ou séries chronologiques, correspond à l’analyse statistique d’observations régulièrement espacées dans le temps.

A time series, also known as a chronological series, is a finite sequence of data points indexed by time, which can be measured in minutes, hours, days, years, etc The length of the series refers to the number of data points it contains It is often beneficial to represent the time series graphically, with time on the x-axis and the value of observations on the y-axis Typically, a chronological series consists of data collected at evenly spaced intervals over time, forming a discrete time sequence.

The order of properties is essential as it influences the dependency of data points and their significance Data points in a time series possess more predefined properties, one of which is that they must be collected through repeated measurements over equal time intervals Additionally, the time interval of the data points must be continuous, and each time unit observation should contain at most one data point.

Cette suite d’observations d’une famille de variables aléatoires réelles notées (X t ) t

∈Θest appelée série chronologique (ou temporelle) qui peut être notée ainsi :

CHAPITRE 2 ETAT DE L’ART ó l’ensembleΘest appeléespace des tempsqui peut être :

A discrete time series is a finite sequence of real numbers (x_t) for 1 ≤ t ≤ n, representing time in units such as minutes, days, or years This is applicable when observation dates are often equidistant, thus Θ is a subset of integers (Z) For example, this can include daily SNCF traveler counts or maximum temperatures These equidistant dates are indexed by integers t = 1, 2, , T, where T represents the total number of observations Consequently, we have observations of the variables X_1, X_2, , X_T derived from the family (X_t) for t in Θ.

⊂Z (le plus souvent Θ = Z) Ainsi si h est l’intervalle de temps séparant deux observations ett 0l’instant de la première observation, on a le schéma suivant

In continuous-time processes, such as those represented by radio signals or electrocardiogram results, the time index takes on values within an interval of real numbers (R) This allows for potentially infinite observations derived from the process (X_t) where t belongs to the interval Θ Such processes are classified as continuous-time processes, as noted by César and Richard (2006) and Charpentier (2006).

Time series are utilized across various fields, including statistics, economics, pattern recognition, control engineering, signal processing, astronomy, meteorology, and entertainment The analysis of time series has been a driving force in research for decades Despite advancements, there remain numerous open topics related to time series across several domains.

— finance et économétrie : évolution des indices boursiers, des prix, des données économiques des entreprises, des ventes et achats de biens, des productions agricoles ou industrielles,

— médecine/biologie :suivi des évolutions des pathologies, analyse d’électrocar- diogrammes,

— traitement du signal : signaux de communications, de radars, de sonars, analyse de la parole,

— traitement des données : mesures successives de position ou de direction d’un objet mobile (trajectographie),

1 http://eric.univ-lyon2.fr/~jjacques/Download/Cours/ST-Cours.pdf

— hydrologie : prévisions du niveau d’eau a certains stades d’inondation ont re- cueilli des données sur les typhons, le climat spécifique, les précipitations sai- sonnières ou les niveaux d’eau.

Les principaux axes d’études autour des séries temporelles qui ont été proposés dans la littérature sont les suivants :

1 La classification : étant donnée une série temporelle X, il s’agit de l’assigner à une des (deux ou plus) classes prédéfinies [61, 55, 129].Young-Seon Jeong (2011) Bing Hu and Keogh (2013) Yi Zheng and Zhao (2014)

2 L’indexation : étant donnée une série temporelle X ainsi qu’une mesure de simi- larité (ou dissimilarité) notée D(X, X 0 ) telle que D(X, X 0 ) est grand si les séries X et X0 sont similaires et petit sinon, il s’agit de trouver la ou les séries temporelles les plus similaires dans une base de données donnée [65, 47].Eamonn Keogh and Mehrotra (2001) Gunopulos and Das (2001)

3 La détection d’anomalies :étant donnée une série temporelle X que l’on consi- dère comme étant "normale", déterminer quelles séries au sein d’une base de données contiennent une "anomalie" [48, 119]Guralnik and Srivastava (2019).

4 La segmentation : étant donnée une série temporelle X = x1, x2, , xT avec i, xi

R, il s’agit de trouver une approximation X^ = k1, k2, , kK avec i, ki R et K ô T et ó X^ est une bonne approximation de X [52, 66].Johan Himberg and Toivonen (2001)

5 La prédiction :étant donnée une série temporelle X = x1, x2, , xT contenant T points, il s’agit de prédire la ou les valeurs suivantes, c’est-à-dire les valeurs xt+1, xt+2, xt+3 [128, 23, 70, 112]Takashi Kuremoto and Obayashi (2014).

Nous nous intéressons dans la dernière partie c’est-à- dire aux tâches de prédiction.Nous proposerons ensuite une méthode de prédiction de séries temporelles.

Prédiction de séries temporelles

Différents types des séries chronologiques

1 Série chronologique univariéeune seule variable varie dans le temps Par exemple, des données collectées à partir d’un capteur mesurant la température d’une pièce toutes les secondes Par conséquent, à chaque seconde, vous n’aurez qu’une valeur unidimensionnelle, qui est la température.

2 Série chronologique multivariée comporte plusieurs variables dépendant du temps Chaque variable dépend non seulement de ses valeurs passées, mais a également une certaine dépendance vis-à-vis d’autres variables Cette dépen- dance est utilisée pour prévoir les valeurs futures.

Modèles de prédiction de séries temporelles univariées

Most real-world applications of univariate time series modeling utilize linear models, with autoregressive (AR) models being the most popular One advantage of these models is their ability to provide a good first-order approximation of the underlying data dynamics Theoretically, they can perfectly model data that is fully described by the first and second moments in a Gaussian distribution world These methods are also appealing due to their simplicity and relative efficiency, even for problems known to exhibit nonlinear dynamics In many cases, nonlinearity is either not significant enough or not consistent over time for AR models to perform acceptably Consequently, autoregressive models remain the most widely used time series models, with parameters learned by minimizing the least squares error A time series is modeled by an AR model if, at a given time t, certain conditions are met.

This corresponds to the autoregressive model of order p The error term is typically defined as white noise, meaning it is uncorrelated over time, with constant variance and a mean of zero Numerous extensions of these models have been proposed, though an exhaustive description is not feasible.

In the early 1990s, nonlinear time series prediction models gained popularity, primarily in finance and transportation, due to their enhanced performance requiring larger datasets for training Various methods emerged, starting with statistical approaches like the ARCH and its extension GARCH models Additionally, machine learning techniques, particularly Support Vector Regression (SVR) and neural networks, became the most favored models for sequence modeling and prediction, typically functioning as autoregressive models.

Ces modốles sont entraợnộs en utilisant une fenờtre glissante comme c’est le cas pour les modèles AR Le mécanisme est illustré sur la Figure 2.1

Modèles de prédiction de séries temporelles multivariées

Time series prediction models have been adapted for multivariate time series forecasting, where multiple values change simultaneously, as shown in Figure 2.2 Among the various existing models, the Vector Autoregressive (VAR) model is one of the most widely used and recognized.

Il s’agit d’une extension naturelle au modèle autorégressif univarié AR Il part du prin- cipe que la valeur d’une série temporelle i dépend des valeurs précédentes de la série

CHAPITRE 2 ETAT DE L’ART pendant une fenêtre temporelle donnée, mais également des valeurs des autres séries considérées pendant ce même intervalle de temps Celà s’écrit :

This model, due to its simplicity, has been widely used and has proven particularly effective in certain predictive tasks However, the assumption of linear dependencies between series is a strong limitation that restricts the expressive power of these models Additionally, the complexity associated with the number of parameters poses further challenges.

A neural network is trained to predict a time series by taking a window of past values as input to forecast the next value However, the quadratic growth in the number of time series to predict makes this approach challenging to implement when dealing with a large number of time series.

Réseaux de neurones récurrents

In recent years, there has been a resurgence in the use of recurrent neural networks (RNNs) for sequence processing, particularly in predictive tasks RNNs have the ability to retain information from past computations, making them especially suitable for handling sequential data In theory, these networks can effectively remember previous inputs, enhancing their performance in various applications.

CHAPITRE 2 ETAT DE L’ART l’information vue dans une séquence arbitrairement grande mais en pratique perdent de leur efficacité sur des dynamiques à très long terme C’est dans ce sens que de ré- cents travaux ont vu l’émergence d’architectures de réseaux de neurones récurrents disposant de mécanismes de "gate" permettant d’améliorer considérablement les ca- pacités de mémorisation des modèles Plus spécifiquement, certains types de réseaux récurrents, les LSTMs [53] et les GRUs [30] se sont avérés particulièrement efficaces pour modéliser des séquences dont la dynamique pouvait s’étaler loin dans le temps ; ce n’est pas le cas de l’approche autorégressive qui de fait, au moment de l’inférence ne se base que sur une fenêtre temporelle de taille fixe dans le passé Finalement, en apprentissage automatique, l’essentiel des progrès réalisés ces dernières années en mo- délisation de séquence a été avec des architectures récurrentes à "gate".

Autres modèles

Numerous models have been proposed for time series prediction, making it impractical to provide a comprehensive overview Many of these models are tailored for specific contexts, such as the Croston model, which is particularly effective for time series that frequently contain zero values Exponential smoothing methods are also commonly employed in practice, although they closely resemble autoregressive methods In this work, we focus on the models that are most familiar to the machine learning community.

FIGURE2.2: – Série temporelle multivariée trois valeurs évoluent en parallèle et la tâche de la prédiction multivariée consiste à prédire la future évolution de chacune des séries considérées

Contexte

Forecasting is defined as the process of making predictions about future outcomes based on past and present data This is typically achieved through the application of qualitative or quantitative methods, depending on the availability of historical data Forecasting is crucial for various reasons and has been utilized across multiple fields, including business, science, and economics While it can be challenging to determine the exact nature of future events, forecasting techniques enable us to gain insights into potential future occurrences.

Various statistical techniques can be employed in forecasting Many foundational concepts of statistical learning were established long ago, with the method of least squares being introduced in the early 19th century by Legendre and Gauss.

The method implemented the first form of linear regression, known as both a statistical algorithm and a machine learning algorithm, capable of estimating relationships between variables and used for prediction and forecasting In 1936, Fisher introduced linear discriminant analysis, followed by the proposal of logistic regression a few years later Both methods compete in analyzing categorical response variables, with discriminant analysis assuming normally distributed independent variables, unlike logistic regression, which many statisticians consider more suitable for various modeling situations In the early 1970s, Nelder and Wedderburn presented the class of generalized linear models, providing a framework for managing common statistical models, including multiple linear regression and logistic regression as specific cases Later, Hastie and Tbishariani coined the term generalized additive models (GAM) to merge the properties of generalized linear models with additive models During this period, most methods were linear, making it computationally challenging to address nonlinear relationships However, advancements in computing around 1980 facilitated the handling of nonlinear problems, leading to the introduction of classification and regression trees by Breiman, Friedman, Olshen, and Stone.

There are various statistical methods available for forecasting, each with different levels of complexity Forecasting methods can generally be categorized into three types: first, subjective or qualitative methods that rely on judgment and knowledge to create predictions; second, univariate methods, often referred to as naive methods, which base forecasts solely on past observations; and third, multivariate methods, which utilize multiple time series for more comprehensive analysis.

CHAPITRE 2 ETAT DE L’ART incluses dans le modèle.

Some of the most effective statistical techniques for time series analysis include simple moving averages, exponential smoothing, autoregressive moving average (ARMA), and autoregressive integrated moving average (ARIMA), commonly referred to as the Box-Jenkins approach.

The Box-Jenkins method, introduced by George Box and Gwilym Jenkins in their book "Time Series Analysis: Forecasting and Control," posits that the process generating a time series can be effectively modeled using an ARMA model if the series is stationary, or an ARIMA model if it is non-stationary.

An ARIMA model is an extension of the ARMA model and has been integral to general time series forecasting methods in hydrology since the publication of Box and Jenkins' article in 1976 These methods involve applying autoregressive integrated moving average models to find the best fit for time series data based on its past values They operate under the assumption that time series are generated from a linear process The primary advantages of forecasting with linear statistics include the simplicity of linear models, their ease of explanation, and the ability to analyze them in great detail.

However, these models are not well-suited for nonlinear systems, as they struggle to accurately estimate the underlying relationships between inputs and outcomes due to the complexity of such systems Machine learning, a branch of artificial intelligence, focuses on creating algorithms that can learn from data without being explicitly programmed for specific tasks The rise in popularity and application of machine learning can be attributed to societal changes, such as increased computing power that facilitates data processing and the vast amounts of data available globally One of the earliest machine learning algorithms, initially known as the perceptron and later as artificial neural networks (ANN), gained significant traction for predictive modeling in hydrology Although the concept of artificial neural networks was first introduced in 1943, it did not attract much attention until the backpropagation algorithm for feedforward neural networks was introduced by Rumelhart in 1986.

État de l’art sur la prévision de niveau de l’eau

The most commonly used forecasting methods are multiple linear regression and time series analysis, which often serve as benchmarks for more complex machine learning techniques Simple models can sometimes outperform more intricate ones However, when dealing with complex and unstable time series, capturing the intricacies of the data can be challenging In such instances, machine learning methods are likely to outperform simpler models.

Over the past 25 years, the use of Artificial Neural Networks (ANN) for predicting water resource variables has become a well-established research field ANN is one of the most widely used data-driven techniques for developing models to forecast water resource variables They have been applied to predict various aspects of water resources, including reservoir inflows, flow forecasts, river discharge, groundwater levels, and other time series issues A comprehensive review article has been published on this topic.

In 1999, Maier and Dandy reviewed 43 articles in the field of hydrology, focusing on publications from 1992 to 1998 that utilized artificial neural networks (ANN) for predicting and forecasting water resources The first article appeared in 1992, with a noticeable increase in publications in subsequent years Most projects concentrated on streamflow, including rainfall, river flow, and reservoir inflow, while others aimed to forecast precipitation, water levels, or water quality variables Despite recognizing the potential of ANNs, the review highlighted significant challenges, particularly the inadequate description of modeling processes, which complicates performance comparisons among different models Given that ANN models rely on available data, it is crucial to adhere to best practices during model development Nearly a decade later, Holger et al published another synthesis article that outlined the steps in ANN model development, providing taxonomies for each stage and covering 210 review articles.

Between 1999 and 2007, the majority of studies focused on predicting flows, with the most commonly used model architecture being the multilayer perceptron (MLP), a type of feedforward artificial neural network.

Souvent, plusieurs méthodes sont utilisées lors de la prévision Mohammad et.al

A comparison was made between artificial neural network models ARMA, ARIMA, and autoregressive models to forecast the monthly inflow of the Dez dam reservoir Monthly output data from 1960 to 2007, covering 42 years, was utilized for training, while 5 years of data were reserved for testing The evaluation of these models involved analyzing the mean squared error and mean bias, highlighting the superior performance of the dynamic artificial neural network model using the sigmoid activation function.

Wang et al conducted a study utilizing various artificial intelligence methods to compare forecasting techniques for monthly river flow data from the Manwan hydropower station Their approach included aggressive moving average models, artificial neural networks, adaptive neuro-based fuzzy inference systems, genetic programming, and support vector machines The study aimed to predict monthly flow time series using long-term observations, and the key findings revealed that AI methods significantly outperformed traditional ARMA techniques.

The Support Vector Machine (SVM) algorithm has been extensively utilized in hydrology, demonstrating superior predictive accuracy compared to Artificial Neural Networks (ANN) and Autoregressive Moving Average (ARMA) models This study aimed to forecast long-term discharges based on 30 years of historical data The findings revealed that SVM outperformed both ARMA and ANN, primarily due to its unique ability to manage the nonlinear characteristics of time series data Additionally, SVM is less prone to overfitting when parameters are appropriately selected, addressing one of the main limitations of traditional time series models, which often struggle with accurately representing nonlinear dynamics.

In this new chapter, we will explore the technical details, including hardware and software resources, datasets, model training, and tips that have advanced our work thus far This section focuses solely on what has been successfully implemented in our solutions, omitting any failed tests From our hardware and software resources to model training and datasets, this serves as a comprehensive guide towards the conclusion of this document.

Water level forecasting using time series analysis has proven to be a valuable tool across various applications, including energy, meteorology, and hydrology The primary aim of analyzing these phenomena is to uncover the governing mechanisms for a better understanding of their nature Advances in water level forecasting have largely relied on supervised machine learning algorithms, which require the training of large labeled datasets.

Ressources matérielles et logicielles

Ordinateur Processeur RAM Carte graphique SE

12288 MB DDR3 Intel(R) HD Gra- phics 4400

Données

Notre travail porte sur l’étude d’une base des données WestoverT(l’une des stations)obtenue à partir du des clients.

All data from WestoverT has been selected for this experiment Water levels are recorded at one-hour intervals by WestoverT The raw data includes information such as date/time and water level, with the date/time representing the measurement interval The dataset spans from 2013 to 2020, covering a total of 8 years.

680 observations Cette base est divisộe en deux parties : base d’entraợnement (don- nées d’inférence) de38 144première observations uni-variables et une base de test de

Description de chaque fonctionnalité de données

2 • Water Level : le niveau d’eau

FIGURE3.1: Visualisations des données avant le nettoyage

The data we are working with is not normalized to a consistent range of values, which is an important characteristic that must be addressed during the training phase of our learning algorithm Therefore, data preprocessing is a crucial step in our analysis.

Formatage des données

Normalization is a technique commonly used in data preparation for machine learning, also referred to as "normalization to zero mean and unit variance," first introduced by Goldin and Kanellakis in 1995 This process ensures that all elements of the input vector are transformed into an output vector with an average close to 0 and a standard deviation near 1 The formula for this transformation is provided below.

Ou la moyenne de la série chronologique est soustraite des valeurs d’origine et la différence est divisée par la valeur de l’écart type.

The purpose of normalization is to adjust the values of numerical columns in a dataset to a common scale, ensuring that the differences in value ranges are preserved without losing information Additionally, normalization is essential for certain algorithms to accurately model the data.

1 https://jmotif.github.io/sax-vsm_site/morea/algorithm/znorm.html

FIGURE3.3: Visualisations des données apres le nettoyage

Cadres

The LSTM model was developed and trained using the Keras neural network API, an open-source deep learning library written in Python that utilizes TensorFlow as its backend TensorFlow is also an open-source framework for machine learning computations.

La courbe d’apprentissage rapide de Keras et sa mise en œuvre facile de modèles d’apprentissage en profondeur ont fait un excellent outil à utiliser pour ce projet.

Structure LSTM

Entrée

The problem formulation involves predicting water levels for a day based on previous readings, which requires transforming the task into a supervised learning problem A key hyperparameter for the model is the number of time lags, which determines how many past time steps will be used for feature engineering The LSTM network expects the input data (X) to be structured in a specific array format: [samples, timesteps, features] Here, 'features' refers to the number of variables to be predicted, 'timesteps' indicates the length of the input window, and 'samples' represents the number of occurrences in the input sequence.

On note que pour chaque dataset, on extrait plusieurs problèmes de tailles T en utilisant des fenêtres glissantes sur les données collectées La taille des fenêtres est de

The dataset for WestoverT has been divided into 24 timesteps to facilitate easier output calculations and to support the input format of our Keras framework.

Couche

When constructing an LSTM model, it's essential to consider the number of hidden layers, the number of LSTM cells in each layer, and the dropout rate There is no definitive method for selecting the number of hidden layers or cells, as successful models have varied widely; for instance, one model featured 4 layers with 1000 cells (Sutskever et al., 2014), while another had 3 layers with just 2 cells (Gers et al., 2000) The optimal configuration depends on the specific application of the LSTM model, with typical layer counts ranging from 1 to 5, and each layer usually containing the same number of cells to achieve an effective structure.

A dense layer, also known as a fully connected layer, connects every neuron in one layer to every neuron in the subsequent layer (Keras, 2018b) Successful models have demonstrated the effectiveness of using dense layers by constructing hidden layers followed by multiple dense layers (Goodfellow et al., 2013).

We have chosen to design our LSTM model with four layers, consisting of two hidden layers and two dense layers The output from the first hidden layer is connected to the second hidden layer, which in turn connects to a dense layer, followed by another dense layer Dropout is applied after each hidden layer to mitigate the risk of overfitting (Reimers and Gurevych, 2017).

Hyper Paramètres optimaux

When building the LSTM model, it is crucial to properly configure and adjust hyperparameters to achieve accurate predictions during backtesting Inspired by Reimers and Gurevych (2017), this article explores empirical testing methods to identify optimal hyperparameters that enhance accuracy while minimizing the risk of overfitting, compared to scenarios where no empirical testing is conducted on the model and data.

In the empirical testing phase, we will build our LSTM model using default hyperparameters that align closely with our case, as identified in various valuable articles Subsequently, we will evaluate each hyperparameter individually to determine its optimal value.

La meilleure valeur pour un hyperparamètre est trouvée en évaluant le modèle LSTM et en le testant avec les données de test.

We will calculate the Mean Squared Error (MSE) between the model's predictions of water level and the actual water level for the day After evaluating all potential values for this specific hyperparameter, we will plot the hyperparameter value on the x-axis and the MSE on the y-axis This allows us to identify the hyperparameter value that results in the lowest MSE, indicating the most optimal configuration.

Abandon

Dropout is a crucial technique that mitigates overfitting by randomly selecting cells from a layer based on a chosen probability and setting their output to zero (Srivastava et al., 2014) An empirical test was conducted to determine the optimal dropout rate, which was then applied to all hidden layers We constructed our LSTM model and trained it, setting the dropout at various values while maintaining a constant difference between consecutive dropout rates As shown in Figure 7, the optimal dropout value in our case is 0.2, as it resulted in the lowest mean squared error (MSE); thus, the dropout is set at 20% During this empirical test, the number of epochs was fixed at 20.

Nombre d’époques

An epoch refers to a complete iteration of training data transmitted through a neural network, as defined by Kaastra and Boyd (1996) During this process, the training data is divided into batches, with a batch size set to 32 This means that the first 32 samples (0-31) are extracted and used to train the network, followed by the next 32 samples (32-63), and so on The epoch continues until all samples have been processed through the network, marking the completion of one full training cycle (Reimers and Gurevych, 2017).

Nombre de cellules LSTM dans chaque couche

Each dense hidden layer contains a specific number of LSTM cells, and our goal is to determine the optimal number of cells for each layer Research indicates that a successful model utilized 250 cells in each hidden layer (Graves et al., 2013).

Fonction de perte LSTM

The loss function measures the distance between the LSTM model's output and the desired output during training to enhance learning efficiency The desired output is defined by user-specified validation data, which we set at 10% of the training data This approach helps prevent overfitting by halting the model during training; after each epoch, the training output is compared to the validation data If the training loss decreases while the validation loss increases, it may indicate overfitting We selected Mean Squared Error (MSE) as the loss function, as it is widely used for time series forecasting (Makridakis and Hibon, 1991).

Évaluation LSTM

Optimiseur

In building the LSTM model, we utilized the Adam optimizer due to its superior performance and rapid convergence compared to alternative optimizers, as recommended by Reimers and Gurevych (2017) We set the decay rate to 0.3 when employing the Adam optimizer.

Calcul des erreurs

We will assess the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) when evaluating the training and testing scores of the LSTM model, where the score represents the evaluation of the chosen loss function, which in this case is MSE Additionally, the Mean Absolute Percentage Error (MAPE) will be calculated during the backtesting of the data to ensure that the model forecasts with high accuracy.

Prédiction réelle

To evaluate the accuracy of our LSTM model, we will make predictions based on historical data to forecast future outcomes This process involves inputting the most recent prediction into the LSTM model, which acts as the final data point from the training set We will then generate a prediction representing the next water level Afterward, we will save this predicted adjusted closing price, referred to as P0, and use it to predict the subsequent water level by feeding P0 back into the LSTM model, resulting in the predicted adjusted closing price P1.

This iterative process will continue step by step until we predict the desired number of days of data Once we forecast the water level for a specific day, we will plot the predicted future water level movement for a five-day period alongside the actual water levels during that time This comparison will help us determine if our model has overfitted the data during testing; if the prediction plot diverges significantly from the actual test data, it indicates that our model has indeed overfitted.

Architecture

Le mécanisme d’attention a été appliqué à l’origine aux problèmes de traduction de texte, en particulier la traduction de phrases anglaise en franỗaisBahdanau et al.

In 2014, it was demonstrated that LSTM networks can effectively address sequential data problems by processing input sequences through LSTM cells to generate a context vector, which is then used for predictions However, a significant issue with this method is that the context vector can become a bottleneck, leading to the dilution of information from the beginning of the source sequence by the time it reaches the context vector This challenge is particularly pronounced in long sequences, as the model struggles to retain the sequential memory of events, resulting in difficulties in accurately predicting outcomes based on extended sequences.

Given the size of each ECG recording in the database, as illustrated in figure ??, each LSTM cell processes an input of 136 To understand which sequence of the recording is likely to follow another, it is essential to utilize information from various points in the sequence, rather than solely focusing on the most recent data This involves paying particular attention to the hidden state of the network at specific parts of the sequence that are deemed critical for the series By retaining only these significant sections, we can generate a context vector that highlights the discriminative elements of the series based on the weight of its different components To address this challenge, we propose the implementation of an attention model.

CHAPITRE 3 SOLUTION PROPOSÉE mécanisme proposéBahdanau et al (2014) 2

FIGURE3.4: Architecture de la solution

Algorithme

Entrée réseau

The LSTM block in the proposed architecture 3.4 processes the input time series as a univariate time series with a timestep equal to the length of the series (sample) Each observation is inputted in the format: [series, timesteps, feature].

Hyperparamốtres et paramốtres d’entraợnement

Selecting hyperparameters is crucial for the performance of machine learning models, as they can significantly impact training time, memory usage, and model generalization on held-out samples Standard methods for hyperparameter selection include manual and automated approaches, with automatic selection often involving grid search or random search to test numerous combinations for optimal results In this project, due to the large number of hyperparameters and the time-intensive nature of training networks, automated tuning was deemed impractical Therefore, a stepwise manual selection of hyperparameters was employed, allowing for a better understanding of how these parameters influence network performance The procedure was straightforward, utilizing basic parameter sets derived from relevant literature.

Normalisation Batch normalisation Learning rate α=0.003 Optimizer Adam (β 1=0.5,β 2=0.999) Loss Mean squared error

In our comparison of various techniques for time series prediction, we explored methods from different domains It is essential to analyze both the quality and quantity of the results to enhance the accuracy of our conclusions.

Mesures de performance

L’évaluation d’un classifieur est souvent beaucoup plus délicate Il existe de nom- breuses mesures de performance disponibles mais nous avons utilisé les mesures suivantes :MAE, MSE, RMSE.

Root Mean Square Error (RMSE)

For regression tasks, Amazon ML employs the Root Mean Square Error (RMSE) metric, a standard in the industry RMSE measures the distance between predicted numerical targets and actual numerical responses (ground truth) A smaller RMSE value indicates better model accuracy, with a perfect model achieving an RMSE of 0 The following example illustrates evaluation data containing N records.

Prédiction dénormalisée sur les données de test

We denormalize the expected water level received during the backtest using our LSTM, GRU, and Bi-Directional LSTM models This step is necessary because, as previously mentioned, we had to normalize all input values for the LSTM model Figures 4.2, 4.3, and 4.4 illustrate the predictions of the water level (red line) made by the LSTM, GRU, and Bi-LSTM models compared to the actual water level (green line).

L’intérêt d’une telle visualisation sera de comparer comment l’action du niveau d’eau se comportera sur nos prédictions de différents modèles.

FIGURE4.2: Modèle LSTM prédiction du niveau d’eau

FIGURE4.3: Modèle GRU prédiction du niveau d’eau

FIGURE4.4: Modèle Bi-directional LSTM prédiction du niveau d’eau

FIGURE4.5: Comparaison de différent Modèle prédiction du niveau d’eau

Dans un format de visualisation nous obtenons les résultats précédents 4.5, qui de prime abord semblent très proches des cours observés sur la période de données prise.

Score du test

La figure 4.6montre les valeurs moyennes de MAE, MSE et RMSE obtenues après évaluation du test.

FIGURE4.6: Comparaison des tests MSE, MAE, RMSE pour LSTM,GRU et BIDIRECTIO- NAL

Conclusion générale

Ce chapitre traitera des conclusions basées sur l’ensemble du travail et aussi des résultats du chapitre4, ainsi que des limites du cadre et des travaux futurs pouvant être réalisés.

This work aims to provide a solution that incorporates an LSTM network to address the challenge of predicting water levels in time series data.

After conducting a comprehensive review of regression algorithms, ranging from traditional methods to the latest neural network techniques, we performed a comparative analysis of the most effective approaches in the current literature Based on this analysis, we selected a reasoned strategy that we believe is suitable for addressing our specific problem.

Les limites de nos travaux sont principalement liées au manque de base des don- nées que nous devrions utiliser pour mener notre étude.

Perspectives

Notre travail s’inscrit dans le cadre d’un projet de faisabilité afin de voir les possibi- lités de pouvoir l’appliquer dans le monde réel.

Comme perspective, nous allons continuer de travailler sur la problématique des sé- quences temporelles (séquences d’évènements) utilisant toujours le réseau LSTM.

The challenge of establishing the structure of a neural predictor, including the number of layers, neurons per layer, activation function shape, input count, and delays, as well as controlling parameters like learning rate and inertia, significantly impacts the effectiveness of neural networks The selection of these parameters must be based on an understanding of the time series behavior and the user's capabilities Consequently, the predictive performance is heavily influenced by these choices, often resulting in suboptimal prediction outcomes This work aims to develop an optimal neural predictor capable of forecasting future values of a time series with minimal performance indicators, such as MSE and R².

Despite satisfactory results across various learning algorithms in terms of prediction error, this method, like many others in the literature, has a significant drawback: the number of data points used during the learning phase affects both the optimization speed and the quality of predictions Increasing the amount of data in the learning phase enhances prediction quality but slows down prediction speed.

The results indicate that the LSTM neural network model, utilizing the sigmoid activation function and an architecture of (4-9-1), demonstrates optimal performance in predicting water levels.

Alfred, R (2016) The rise of machine learning for big data analytics In2016 2nd In- ternational Conference on Science in Information Technology (ICSITech), pages 1–1.

Bahdanau, D., Cho, K., and Bengio, Y (2014) Neural machine translation by jointly learning to align and translate arXiv preprint arXiv :1409.0473.

Bing Hu, Y C and Keogh, E (2013) Time series classification under more realistic assumptions In Proceedings of the 2013 SIAM International Conference on DataMi- ning, page 578–586.

César, E and Richard, B (2006) Les séries temporelles.Mars.

Charpentier, A (2006) Cours de séries temporelles : théorie et applications.Université Paris Dauphine.

Eamonn Keogh, Kaushik Chakrabarti, M P and Mehrotra, S (2001) Locally adaptive dimensionality reduction for indexing large time series databases 30(2) :151–162.

Goldin, D Q and Kanellakis, P C (1995) On similarity queries for time-series data : constraint specification and implementation InInternational Conference on Prin- ciples and Practice of Constraint Programming, pages 137–153 Springer.

Gooijer, J G D and Hyndman, R J (2006) 25 years of time series forecasting.Interna- tional journal of forecasting, 22(3) :443–473.

Gunopulos, D and Das, G (2001) Time series similarity measures and time series indexing In AcmSigmod Record, 30 :624.

Guralnik, V and Srivastava, J (2019) Event detection from time series data In Procee- dings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, page 33–42.

Tiêu đề	Implements an enhancement in the pre- and post-processing capabilities of the efdc explorer modeling system
Tác giả	Auguste Dalby
Người hướng dẫn	Dr. Nghiem Tien Lam
Trường học	Université Nationale du Vietnam, Hà Nội
Chuyên ngành	Systèmes Intelligents et Multimédia
Thể loại	Mẫmoire de fin d’études en master informatique
Năm xuất bản	2021
Thành phố	Hà Nội

Định dạng
Số trang	44
Dung lượng	0,97 MB