Related work In the context of traffic prediction, it is important to recognize that it involves forecasting time series data.. [2] show that classical machine learning models such as S
Introduction
Global population is projected to reach about 8 billion by 2023, according to World Population Prospects, and this growth drives rapid urbanization as people move to cities in search of better living conditions, jobs, services and entertainment This urban expansion brings social, economic and environmental costs, including air and water pollution, toxic waste disposal, and declines in public health and safety, while also intensifying traffic congestion that increases transport needs, makes travel less efficient, and lowers productivity Analyses highlight three core solutions to the congestion challenge: developing and using transport alternatives, upgrading and expanding infrastructure, and smarter traffic management Widespread adoption of alternative transport modes requires substantial government investment, and geographic constraints can limit infrastructure expansion As a result, traffic management has become a key focus, with the rise of intelligent and smart systems and the development of intelligent transportation systems (ITS) to manage traffic in multiple ways Within ITS, traffic forecasting—measuring, modeling and predicting real-time traffic conditions—provides crucial insights for congestion mitigation, accurate arrival estimates and optimal routing, benefiting both everyday travel and the planning of new roads.
The need for traffic forecasting is even more urgent and necessary in Vietnam, in particular,
Ho Chi Minh City faces annual traffic-jam losses estimated at up to $6 billion, highlighting the drag on Vietnam’s economy According to Statista, Vietnam's urbanization rate rose from 31.08% in 2011 to 38.05% in 2021, an increase of nearly 7 percentage points over a decade, while Japan's urbanization remained high around 91% during the same period This rapid urbanization has fueled a booming vehicle market, with nearly 3 million motorbikes purchased each year Although authorities have implemented improvements, most efforts focus on expanding transport alternatives and infrastructure, which helps ease congestion but also makes traffic flow management a pressing challenge—and it underscores the growing need for accurate traffic forecasting in Vietnam.
Related work
Traffic prediction is a time series forecasting task based on a chronologically ordered sequence of data points Traditional methods such as ARIMA, SARIMA, VAR, and ARCH rely on assumed linear relationships between features and the target, with parameters estimated by ordinary least squares to uncover underlying patterns and dependencies These OLS-based approaches can struggle in high-dimensional settings, face multicollinearity when predictors are correlated, and may miss non-linear relationships or effective feature selection To address some of these challenges, factor models like dynamic factor models offer dimensionality reduction to handle high dimensionality and multicollinearity, but they still have difficulty capturing non-linear dependencies and identifying the most relevant features for accurate predictions.
Machine learning has the potential to overcome the limitations of traditional forecasting models by leveraging data-driven patterns rather than predefined equations Unlike traditional approaches, most modern ML models do not assume explicit relationships between independent and dependent variables, allowing them to uncover nonlinear and complex interactions in the data This shift toward flexible, data-driven modeling is supported by surveys conducted by researchers across industries, which show growing interest in applying ML techniques to improve forecast accuracy and adapt to changing conditions.
Classical machine learning models such as Support Vector Machines, Random Forests, and boosting techniques have long been applied to traffic flow prediction (Tedjopurnomo et al [6], Yuan et al [3], Yin et al [2]) Yet, surveys indicate that deep learning has recently gained substantial attention in this domain In fact, a study by Manibardo et al [5] shows that over the three years 2018, 2019, and 2020 there was a substantial rise in deep learning–based traffic flow prediction research—216, 370, and 368 papers respectively—as indexed by the SCOPUS database.
Deep learning, a subset of machine learning, has drawn substantial attention in recent years for its ability to learn and extract complex patterns from data A key advantage is its capacity to automatically discover hierarchical representations and feature representations directly from raw data, eliminating the need for manual feature engineering This capability is realized through artificial neural networks—interconnected layers of neurons that mimic the structure of the human brain.
Deep learning models, especially neural networks, exhibit universal approximation capabilities, meaning they can approximate any function given enough parameters and training data This versatility makes them highly flexible and capable of capturing intricate relationships within data, a strength particularly valuable for complex tasks such as traffic flow prediction.
Deep learning often assumes independent and identically distributed (IID) training data, but real-world datasets frequently exhibit dependencies and temporal correlations driven by factors such as traffic patterns, weather conditions, or surrounding environmental events Capturing these dependencies is a central challenge for model accuracy and generalization To address this, researchers are developing approaches that inject temporal information into models, including recurrent neural networks (RNNs) for sequence data and graph neural networks (GNNs) for relational structure, enabling the capture of both sequential and relational dependencies in the data.
Graph Neural Networks (GNNs) address the limitations of traditional deep learning models—CNNs, LSTMs, and Transformers—by effectively capturing interdependencies among data points Their versatility spans multiple domains, with notable success in traffic flow prediction as illustrated by Zhou et al [7] Additional evidence from Jiang and Luo [4] highlights successful traffic forecasting applications of GNNs In traffic networks, the flow at one location is shaped by the flows at neighboring, connected locations; by modeling traffic data as a graph, GNNs can exploit this structure to achieve more accurate predictions than other deep learning models.
Graph neural networks (GNNs) excel at capturing relational dependencies by propagating information across a graph Through iterative aggregation and updates from neighboring nodes, GNNs develop rich representations that encode the underlying relationships in the data, making them particularly well-suited for structured data tasks such as traffic networks.
Graph neural networks (GNNs) improve traffic flow prediction by integrating historical data from a given location with flow patterns from neighboring locations, capturing spatial correlations and propagation effects across the traffic network This approach delivers more accurate, context-aware forecasts that reflect the dynamic interplay among regions and the network-wide interactions driving congestion and flow.
As graph size grows, the computational cost of graph neural networks (GNNs) rises, creating memory and computation-time bottlenecks that hinder performance While parallelization and distributed training have been explored to alleviate these constraints, the gains are limited and the overall cost remains high This project proposes a resource-efficient solution that requires fewer computational resources than GNNs while still capturing the interdependencies among traffic flows at interconnected locations.
Additionally, our approach leverages state-of-the-art deep learning techniques that are suitable for traffic flow prediction.
Problem statement
This thesis focuses on predicting urban traffic by integrating Graph Neural Networks (GNNs) into my architectural design, with the central aim of forecasting bus speeds across the city By applying Graph Neural Networks to the proposed architecture, the study models the spatiotemporal dependencies among road segments, bus routes, and traffic signals to produce accurate bus speed predictions under varying congestion and time-of-day conditions The goal is to develop a robust framework for urban traffic prediction that leverages the relational structure of the transportation network to inform transit planning, improve bus reliability, and enhance overall traffic management.
Using satellite data of the same bus group, this study forecasts Ho Chi Minh City traffic in the immediate future, delivering a five-minute-ahead prediction The dataset originates from 2012, so the goal is to provide insight into the architecture and implementation required to demonstrate the process rather than to assess current traffic conditions The emphasis is on outlining a practical predictive pipeline—from data preprocessing and feature extraction to model selection and deployment—that showcases how satellite-derived signals can be leveraged for public transit insights and planning.
This dataset contains the coordinates of buses operating in Ho Chi Minh City over an eight-day period from November 9, 2012, 00:00:00 to November 17, 2012, 23:59:59 It is derived from GPS signals captured by devices installed on the buses and received by multiple satellites, providing reliable location tracking across the study area The buses record their speed data at regular intervals, enabling analyses of urban transit dynamics and mobility patterns during the study window.
Due to delays in data transmission, the track time attribute (trktime) is inconsistent and can appear random, leading to multiple entries being recorded within the same minute and creating gaps in data collection that span minutes or hours As a result, the dataset is not uniformly distributed and lacks a defined order or arrangement of attributes The raw data is divided into 67 separate CSV files, each with a different number of rows, and the contents are not sorted or organized by any specific criterion, so records with a track time from a later date can appear before data from earlier dates.
This dataset spans seven days and comprises 21,378,346 rows collected from 919 buses, providing comprehensive GPS and telemetry data It logs each record with deviceid, latitude, longitude, speed, satellite, lock, and trktime, enabling precise tracking and analysis of bus movements Notably, the data covers both active operation and rest periods at stations, offering a complete view of bus activity and downtime for transit analytics.
Figure 5 A sample of the raw dataset deviceid latitude longitude speed satellite lock trktime
Figure 6 A sample of the dataset after converting to columns
Solution
To solve the problem, I adopted an implementation from a related study [23], which will serve as the main reference for this thesis The cited research provides a complete model layer that I will integrate into my implementation, allowing me to focus on preprocessing and evaluation as the core remaining tasks I also leverage another implementation from the same topic to streamline development and align with established methodologies The theoretical solution in my thesis is built around reusing these proven components to accelerate progress, while applying careful preprocessing and rigorous evaluation to validate the model’s performance.
Preprocessing aims to obtain the adjacency matrix and speed arrays by systematically organizing data The process starts by grouping data by time intervals and ID to capture temporal and identity-based relationships Next, the data map is partitioned into multiple parts, creating manageable segments From these data partitions, the adjacency matrix is constructed, representing the connections among entities Finally, the adjacency matrix is used to derive the speed arrays, completing the preprocessing workflow This approach yields a structured foundation for downstream analysis and modeling.
In the model building step, the goal is to fit the data into an already built Graph Neural Network (GNN) model This involves defining the Graph Neural Network (GNN) layer, then the Graph Neural Network plus Long Short-Term Memory (GNN-LSTM) layer to capture spatiotemporal dependencies, specifying the hyperparameters along with the model inputs and outputs, and finally compiling and fitting the model to the data.
Model evaluation involves benchmarking the model against a baseline and probing its behavior when key parameters are varied during both the processing and model-building stages Specifically, we compare the model to its baseline to quantify gains or losses in performance, and we test it with different parameter inputs to understand sensitivity, stability, and robustness across configurations.
This solution is theoretical, so implementation may reveal challenges, such as incomplete bus data where the bus does not consistently send back information at every interval, complicating data processing It also raises questions about how to handle the gaps where data is missing Detailed remedies and methods for addressing these issues are described in the Methodology section (Section 4), while the effectiveness and results of the proposed approach are presented in the Evaluation section (Section 5).
Preprocessing
The dataset consists of these attributes:
• Device ID (deviceid in the dataset)
• Track time (trktime in the dataset)
As the satellite and lock attributes are attributes belonging to the satellite itself, it doesn’t affect the other attributes at all, therefore the two are removed from the dataset
As seen in Figures 5 and 6 of Section 3.1, bus data are collected in no particular order and must be normalized or grouped to extract meaningful information Because the goal is to predict five minutes ahead, the data are grouped by vehicle ID and track time (trktime); within each 5-minute window, records sharing the same ID and time interval are averaged for longitude, latitude, and speed There are 2303 five-minute intervals in total, so each bus can yield up to 2303 rows after grouping The resulting dataset is a time-windowed, ID-based representation suitable for predictive modeling.
Figure 7 A sample of the dataset with ID 109 after grouping This bus has 2303 rows meaning that it has data on every 5 minutes interval
However, in some cases, the data collected will be missing on some intervals, resulting in the bus not have the required rows of data:
Figure 8 A sample of the dataset with ID 209 after grouping This bus is missing some rows due to the lack of data
Sometimes the missing row could be in the middle of the period (e.g missing data from 9 th
In November 2012, the interval from 9:00:00 to 10:00:00 could be incomplete if the start or end timestamp is missing To address these missing timestamps in time-series data, I introduced a new attribute called next_time, which holds the next row’s track time This next_time reference helps preserve continuity across time intervals, improves data integrity for interval analyses, and makes it easier to handle gaps in timestamped records.
Figure 9 shows a sample from the dataset with ID 44, where the next_time attribute stores the next track time for the bus, enabling easy sequencing of arrivals For the final timestamp, 16 November 2012 23:55:00, the next_time value is NaT, indicating that there is no subsequent track time.
The next time attribute is used to scan for intervals that are not five minutes apart and to fill data until there are no remaining gaps Filled rows are assigned a speed value of -1, which differentiates them from empty rows (speed 0) and from rows that did not exist previously (speed -1).
Figure 10 A sample of the dataset with ID 44 The missing row(s) are filled in so that every interval have data, resulting in every bus having the same amount of row
Now that all timestamps are filled, an obvious problem arises: the data will contain many zeros for speed In the referenced study [23], the data collected within each time interval reflects the maximum speed observed, so almost all entries are actual values This conflicts with my data, where the input speed is predominantly 0 or -1 If I convert -1 to 0, the artificial value -1 for unknown speed would be indistinguishable from a real 0 indicating the bus is stopping To address this, I decided to apply a specific imputation approach for -1 speed values to insert a meaningful substitute.
Each bus follows a fixed route every day, so missing timestamps are most likely associated with the same route segments When timestamps are missing, we can estimate their values using the bus speed observed at the same time on other days If a bus has sparse data, and a valid speed value exists for that position on any day within the period, that value can be applied to all days If there are multiple speed observations for the same time across different days, we can compute the average speed over the period and fill the others with that average Note that we should only fill rows that were previously marked as -1; entries that were 0 should remain unchanged.
This approach is applied to every bus, producing multiple speed values at approximately the same location from different buses The method for resolving these competing values is detailed in the Speed arrays section below Its primary aim is to fill in the unknown gaps by imputing additional speed data, a step that may introduce some error into later predictions but generally enhances the dataset by adding more input values and repopulating the speed array with meaningful numbers rather than 0 or -1.
An adjacency matrix is a compact representation used in graph theory and network analysis to describe the connections between nodes It is a square matrix in which each row and column corresponds to a node, and the entry Aij indicates whether there is an edge from node i to node j; typically 1 means an edge exists and 0 means no edge This format simplifies the analysis and manipulation of graphs, enabling efficient computation of key properties such as connectivity, shortest paths, and clustering coefficients For example, in adjacency matrix A, a connection from i to j is indicated by Aij = 1.
Below are simple examples of adjacency matrices
Figure 12 Visualization of a simple undirected graph and its adjacency matrix
Figure 13 Visualization of a simple directed graph and its adjacency matrix
In this thesis, an adjacency matrix capable of fully representing the road network traversed by buses is required To faithfully model the bus routes, the matrix must treat distinct entities—such as stops, intersections, and road segments—as separate nodes, so that each node uniquely identifies a network element Edges should reflect permissible transitions between these elements, capturing directionality and connectivity to reveal how the bus system operates By building a graph that encodes topology, flow, and constraints of the network, the adjacency matrix offers a compact, analyzable representation for network analysis, route optimization, and performance evaluation A key challenge is choosing the appropriate level of granularity: too coarse a representation risks missing critical transfer points, while too fine a representation can create unnecessary complexity Therefore, the methodology should specify criteria for node definition, edge weighting, and how to handle multi-modal transfers to ensure the matrix accurately mirrors real-world operations.
To handle continuous coordinate values, I treat the region as a map and discretize it by partitioning the coordinates into a grid, for example 500 by 500 cells, so each cell covers a specific latitude and longitude range (such as Latitude 0 to 0.1 and Longitude 0 to 0.1) This grid-based discretization confines the problem to a finite set of entities, making computations tractable However, many grid cells end up unused because they fall outside the actual bus routes, resulting in zeros in both the adjacency matrix and the speed arrays This situation is a classic sparse matrix problem, where the majority of data represents empty space rather than useful information.
Figure 14 Example of a map with one bus route The amount of unused space in this map is much larger than the route of the bus – the ‘useful’ space
To solve the sparse matrix issue, we remove the unvisited coordinate ranges that create gaps As we loop through the coordinates the bus traverses, we assign an identity to each new coordinate range and use these identities as the nodes of the adjacency matrix For the edges, we maintain an edge list of two-element pairs [source, destination], reflecting movements between coordinate ranges Because the graph is directed, the source and destination are not interchangeable We set the corresponding adjacency matrix entry to 1 whenever there is at least one recorded movement from a source range to a destination range.
With the adjacency matrix in hand, I populate the speed arrays that feed the model The speed array is a two-dimensional matrix that stores the speed of every node at five-minute intervals To capture temporal dynamics, the first dimension represents time, so each row corresponds to a successive 5-minute snapshot and the second dimension indexes the nodes Consequently, there will be one row per 5-minute interval and one column per node, providing a complete time-series input that the model can learn from.
The dataset contains 2304 rows, each representing a 5-minute interval over eight days The second dimension captures speed values, with length equal to the number of nodes in the adjacency matrix The speed of a node is computed by averaging the speeds of all buses operating within the same geographic area for that interval, aggregating multiple readings when they occur at the same time This approach helps mitigate occasional missing values and noise from individual buses, yielding a robust speed feature for each node in downstream analysis.
Model building
Following data preprocessing, I obtained the adjacency matrix and the speed arrays, both of which are used as inputs for the model The thesis employs a GNN + LSTM model, a spatiotemporal architecture that combines graph neural networks and long short-term memory networks to capture both spatial relationships and temporal dynamics.
This article begins with an overview of the basic concepts of Graph Neural Networks (GNNs) and Long Short-Term Memory networks (LSTMs), then explains how their combination in a GNN+LSTM architecture leverages both graph-structured information and temporal dynamics, and finally outlines the parameters used for the model setting, model compiling, and fitting, including items such as learning rate, batch size, number of epochs, optimizer, loss function, activation functions, regularization, and data preprocessing.
Graph Neural Network (GNN)
Graph theory studies graphs, fundamental mathematical structures consisting of nodes (vertices) and edges that connect them Nodes represent the objects or entities under study, while edges depict the relationships or interactions between these nodes This framework enables powerful analysis of complex systems such as social networks, transportation networks, and biological networks, helping us model how entities interact and uncover important properties and patterns within the system Graphs can be directed or undirected, depending on whether the connections have a direction; undirected graphs treat relationships as bidirectional, while directed graphs capture asymmetric relations.
An undirected graph is a graph in which edges have no direction, meaning the connections between nodes are bidirectional If there is an edge between nodes A and B, node A is connected to node B and node B is connected to node A, reflecting a symmetric relationship Undirected graphs are ideal for representing mutual or symmetric relationships, such as friendships in a social network or connections in a transportation network, where the relationship works the same in both directions.
A directed graph, also called a digraph, is a type of graph in which edges have a specific direction In a directed graph, connections between nodes are one-way, with an arrow indicating the direction of each edge If there is a directed edge from node A to node B, it means there is a one-way connection from A to B, enabling traversal from A to B even if the reverse path from B to A does not exist unless a separate edge is present These directional edges make digraphs ideal for modeling asymmetric relationships such as workflows, web link structures, or citation networks.
In a directed graph, the relationship between two nodes is not necessarily bidirectional; an edge from A to B does not imply an edge from B to A This asymmetry makes directed graphs well-suited for representing directional relationships, such as the flow of information in a communication network or the hierarchical structure of a web page's linking pattern.
In both directed and undirected graphs, the nodes represent entities or elements, while the edges represent the connections or relationships between those entities
Figure 7 Examples of a directed graph (left) and an undirected graph (right) b) Graph Neural Network:
Graph Neural Networks (GNNs) have emerged as a powerful framework for learning from graph-structured data Unlike traditional neural networks that operate on grid-like data,
Graph neural networks (GNNs) are designed to handle the complexities of graph data, capturing the relationships and interactions among entities across diverse domains They exploit the inherent connectivity of graphs, where nodes represent entities such as users, molecules, or web pages, and edges indicate the relationships or connections between them By aggregating information from a node's neighbors, GNNs learn rich representations that reflect both local structure and global network patterns, enabling tasks like social network analysis, molecular property prediction, and website ranking.
Graph neural networks (GNNs) excel by propagating and aggregating information across a graph, updating node representations iteratively with data from neighboring nodes This process captures both local and global structural patterns, enabling expressive node embeddings that encode rich graph-level semantics In social networks, such representations reveal social connections and community structures, making tasks like node classification, link prediction, and personalized recommendations more effective.
Graph Neural Networks (GNNs) excel at capturing complex dependencies and interactions among nodes by propagating information across the graph, updating node, edge, and global-context features while preserving permutation invariance They employ a graph-in, graph-out architecture: input graphs with initial node, edge, and global features are progressively transformed into enriched embeddings without changing the graph's connectivity As information propagates, each node aggregates knowledge from its neighbors, refining its representation based on both local neighborhood structures and the overall global context This aggregation is implemented through trainable message-passing functions, where each node updates its embedding by considering the features and relationships of its neighbors.
Graph Neural Networks (GNNs) have demonstrated remarkable performance across diverse domains, including social network analysis, recommendation systems, bioinformatics, and chemistry In social network analysis, GNNs uncover hidden community structures, identify influential nodes, and predict missing links In recommendation systems, GNNs leverage user–item interaction graphs to deliver personalized recommendations In bioinformatics, GNNs predict protein structures and classify molecular compounds based on their graph representations.
Furthermore, GNNs have been applied in chemistry to model molecular properties, predict chemical reactions, and design new drugs
A basic GNN structure would require 3 steps:
GNN input requirements consist of node features, the graph’s edges, and the edge weights Node features are built from each node’s attributes; for a person node, attributes such as name, age, and occupation can be converted into a fixed-dimension feature vector These feature vectors are stored in a node array with consistent dimensions, which serves as the primary input to the Graph Neural Network, alongside the edge connections and their weights.
Edges are stored in an array where each item represents a single edge Each edge item is a two-item array [source, destination] that specifies the source node and the destination node For undirected graphs, where edges have no inherent direction, adding an edge adds both [source, destination] and [destination, source] entries to represent the bidirectional connection.
If a graph contains edge weights, they are stored in a separate weight array that aligns with the edge indices, enabling each weight to correspond to its respective edge This edge weight array is then included as part of the GNN input.
In Graph Neural Networks (GNNs), initializing all inputs—node features, edges, and edge weights—is typically treated as the first or initialization layer This foundational layer sets the stage for subsequent layers to perform computations and update node representations according to the graph's structure and connectivity.
Providing these inputs to a Graph Neural Network (GNN) enables efficient processing of graph-structured data The model then performs message passing between nodes and learns meaningful representations that capture the underlying patterns and relationships within the graph.
The step is repeated as many times as there are update layers in the structure In each update layer, all graph information is updated in parallel, with every update’s arguments drawn from the previous layer.
Figure 8 [10] illustrates how node features in graph neural networks are updated at each layer through the aggregation of neighboring features The boxes shown in the figure represent the hidden neural networks responsible for these layer-wise update computations.
Long Short-Term Memory (LSTM)
Recurrent Neural Networks (RNNs) are a specialized type of artificial neural networks designed to efficiently process sequential data Unlike traditional feedforward networks, RNNs use feedback connections that let them retain information from previous time steps and capture temporal dependencies in the data, enabling context-aware processing across sequences.
Recurrent neural networks (RNNs) are especially well-suited for sequential data tasks such as natural language processing, speech recognition, machine translation, handwriting recognition, and time series analysis In these domains, preserving the order and context of data is crucial for accurate understanding and predictions By maintaining the history of inputs and the flow of information over time, RNNs excel at modeling and interpreting sequences.
Recurrent neural networks (RNNs) maintain an internal memory, the hidden state or cell state, which encodes information from past inputs This memory lets the RNN capture dependencies and patterns across a sequence, enabling context-aware processing As the RNN processes each input, it updates the hidden state based on the current input and the previous hidden state, producing a concise summary of the sequence history That updated hidden state then influences the network's prediction or output at each time step, allowing the model to reflect temporal dynamics in its results.
Consider a deep network arranged in three layers as in figure 10(a), where each layer has its own weights (w1, w2, w3) and biases (b1, b2, b3) that differ in dimension, making the layers independent and unable to memorize previous outputs In a recurrent neural network (RNN), the same weights and biases are shared across all layers, so the output from one layer becomes the input to the next, rendering the layers compatible If these layers use identical weights and biases, one can view the architecture as looping a fixed number of times before producing the final output, resulting in an RNN with three hidden layers effectively collapsed into a single loop that runs three times, as in figure 10(b) The RNN formulations are usually expressed in terms of the hidden state (for example a = f(b, c)), which often abstracts away the explicit mechanics of state updates—a topic we will explore in more depth in the following section.
(a) An example deep neuron network where every layer with different i/o run once
(b) The result when integrated RNN format where a single layer loops n time
Figure 10 A neuron network in normal and RNN format b) Long Short-Term Memory
Long Short-Term Memory (LSTM) is one of the best-performing algorithms for time-series forecasting It overcomes a key limitation of traditional recurrent neural networks by preserving information across longer sequences, effectively addressing the vanishing gradient problem that hinders learning of long-term dependencies Unlike standard RNNs, LSTM uses gating mechanisms—input, forget, and output gates—that regulate the flow of information and enable selective memory retention This architecture improves forecasting accuracy in complex temporal data, making LSTM a preferred choice for modeling time-series where long-range patterns matter.
LSTM are gap-dependent, meaning if the gaps between 2 states are too far (e.g state 1 and
Traditional recurrent neural networks (RNNs) often struggle to connect information across long sequences, limiting their ability to learn long-term dependencies Long short-term memory networks (LSTMs) are a specialized type of RNN designed to capture these long-range dependencies A key advantage of LSTMs is their ability to bridge information across long gaps, enabling them to handle longer input sequences, and as the data size grows, LSTM models tend to perform better [17].
Figure 11 [22] An example of a standard LSTM module/cell with the symbol notes
An LSTM cell comprises four key parts: the forget gate, the input gate, the cell state update, and the output There are two main activation functions used: the sigmoid (σ) and the tanh functions The sigmoid maps inputs to the range 0 to 1 and acts as a gate that decides how much information to pass through, effectively filtering important information from the unimportant The tanh function scales inputs to the range -1 to 1 and is used for generating candidate values and biasing the input contributions within the input gate, with this bias being trainable, similar to biases in GNNs Together, these activations and gates control the flow of information, determine how the cell state is updated, and shape the LSTM’s output.
The forget gate determines which information needs attention as well as which does not by looking at the 2 inputs (xt, ht-1) and outputs a series of numbers between 0 and 1 that have its length match with the length of the cell state Ct-1 The number will determine the ‘keep’ rate of the information, with 1 means ‘completely keep’ and 0 means ‘completely forget’
The input gate determines what new information to store also by looking from the 2 inputs to output a series of numbers between 0 and 1, similar to the forget gate In this case, the number 1 means ‘not important’ and 0 means ‘completely add’
Figure 11(c) LSTM cell update state
The LSTM cell state update finalizes memory processing after the forget gate discards unnecessary data and the input gate adds new information The updated state combines the remaining portion of the previous cell state with the newly added values, yielding a refined cell state that carries forward the relevant information.
Two factors determine the cell’s output: the inputs and the current cell state This mechanism works like an input gate, but the input is used differently here, filtering the data stored in the current state to produce the output As a result, the current state’s information is selectively passed through by another gate to form the output, the hidden state h_t, which is provided for use at this step and in downstream computations Whenever data from this state is accessed, a copy of the output (the h_t) is made available for subsequent processing.
Figure 14 Example of a map with one bus route The amount of unused space in this map is much larger than the route of the bus – the ‘useful’ space.
GNN + LSTM Model
Spatiotemporal prediction uses models that combine graph neural networks (GNNs) and recurrent networks like LSTM to capture both spatial and temporal dependencies In spatiotemporal modeling, part of the architecture acts as a space information capturer while another part handles temporal dynamics, either by fusing two algorithms (for example, a Graph Convolutional Network followed by an LSTM) or by designing a single architecture that links multiple values across space and time When using a GNN + LSTM approach, the model first extracts relational features from graph-structured data with the GNN, then feeds these features into an LSTM to analyze them across multiple timestamps and predict future states.
Next, I will describe in concept the model initializing process, including the scale/dimension of the input/output as well as the parameters of it
First, I divide the speed array, which is the model input, into train, validate and test data:
- Train data (50%) is the part of data which is feed into the neuron network in order for the network to learn the parameters
- Validate data (30%) is the part of data which acts as the result for the train data in the learning process, creating feedback to the train data
- Test data (20%) is the part of data which is used after the model is completed to check the metrics, such as accuracy, loss value, mean squared error…
The dataset contains 2304 total rows, partitioned into 1152 rows for training, 691 rows for validation, and 460 rows for testing The second data dimension matches the speed array, representing the number of nodes defined in the setup.
Creating a Graph Neural Network (GNN) layer requires integrating the adjacency matrix with the input and output feature sets, as well as selecting an effective aggregation, combination, and activation strategy The layer's input feature count is set to 1, representing the single feature in use—the values of the speed array Through this setup, the GNN layer aggregates neighboring information, fuses it with the node's own feature via the combination step, and applies the activation function to produce the resulting output features.
Within the configuration, the GNN layer outputs a feature vector whose length is configurable; in this setup it is set to 10 features This output serves as input to the next layer, the LSTM The aggregation and combination parameters regulate the respective operations, with aggregation set to mean (average) and combination set to concat (append) The activation type is None.
Initiating the LSTM layer requires these following parameters:
1 Batch size (set to 64): controlling the number of inputs at one iteration
2 Input sequence length (set to 10): controlling the length of the input with timestep as the unit
3 Input feature (set to 1): controlling the shape of the input of the layer
4 Units (set to 64): controlling the shape of the temporary output space These numbers are then use to predict the result(s) (real output)
5 Forecast horizon (set to 1): controlling how many timestep do we want to predict, which is also the final output of the layer
Finally, as we completed the model, we would then train the model which requires:
• Optimizer (set to RMSprop at learning rate 0.002): controlling the learning rate of the model
• Loss (set to mean squared error): controlling the algorithm to calculate the loss of each iteration
• Epochs (set to 10): controlling the number of iterations.
Test 1: Full data test
When run with full data, the Mean Square Error (MSE) value for the model is
Processing the model is time-consuming, so the comparison will rely on a small data sample drawn from the first 40 hours Although the model has many hyperparameters that could be tuned to boost effectiveness, this thesis does not have the resources for exhaustive tuning Instead, I will focus on a subset of parameters that are most likely to significantly impact the model’s performance.
Test 2: Map parts
One key parameter is the number of map parts into which the map is divided Theoretically, dividing the map into more parts can improve the accuracy of the nodes (points) and consequently yield better results The results show the mean squared error (MSE) metrics for different values of map parts, highlighting how varying the map-part count influences performance.
These are the prediction results obtained after using a sample of data for preprocessing and model building The closer the Mean Squared Error (MSE) is to zero, the more accurate the model becomes As expected, with the MSE approaching zero, the model demonstrates stronger predictive accuracy on this dataset.
1000 map parts, the model is able to predict the values more accurate, with the cost of more computational power.
Test 3: Time intervals
Unfortunately, when I decrease the intervals to less than 5 minutes, Google Colab can’t process and produce memory overflow error Therefore, I decided to test with 10 minutes interval and 30 minute interval
From the result, the one with 30 minute interval performs better 5 minutes interval, while the
Among the tested data intervals, the 10-minute interval shows the weakest performance due to how speeds are aggregated over longer windows In the 30-minute interval, there are many more instances of buses reporting 0 speed, which lowers the average speed per timestamp and makes it appear lower than the 5-minute interval Consequently, the prediction error for the 30-minute interval is reduced, creating the misleading impression that 30-minute intervals outperform 5-minute intervals.
Test 4: Categorize missing values
During this final test, missing values after grouping are assigned a speed of -1 to distinguish them from entries with a confirmed 0 speed, i.e., not moving When initializing the speed arrays, any -1 speed is inherited from the previous timestamp, meaning we assume the speed did not change during the missing period.
In speed array 2, the two -1 value on the top right would become 12, and the -1 value on the middle right would become 0
When we applied this algorithm to observe if it improves on the original one:
In this case, the model performs better without the algorithm mentioned Since we are only observing on a small dataset, the evaluation may change with bigger dataset.
Conclusion
This project introduces a method to predict bus speeds five minutes into the future for Ho Chi Minh City, offering a practical approach to mitigating urban traffic congestion The model combines Graph Neural Networks (GNN) and Long Short-Term Memory networks (LSTM) and demonstrates how hyperparameter tuning and additional layers for dimensionality control can enhance predictive accuracy and robustness By integrating spatial relationships captured by the GNN with Temporal dynamics modeled by the LSTM, the framework provides timely and actionable speed forecasts that support transit planning and congestion management in the city The study also highlights experimentation and results that illustrate the effectiveness of the hybrid GNN-LSTM architecture under varying traffic conditions.
I hope that this thesis can be integrated and used with newer data so as to help solve the more current problems such as:
• Traffic Management: The predicted bus speeds can be used to estimate the traffic conditions along the bus routes
• Real-Time Navigation: The predicted bus speeds can be incorporated into navigation applications or GPS systems
• Passenger Information Systems: With predicted bus speeds, passenger information systems at bus stops or through mobile applications can provide commuters with estimated arrival times
• Intelligent Transportation Systems: The predicted bus speeds can be integrated into intelligent transportation systems that aim to optimize traffic flow, reduce congestion, and improve overall transportation efficiency.