00051000956 applying data augmentation method to improve anomaly detection with graph neural network

Motivation

Graph neural networks (GNNs) are gaining popularity for their effectiveness in learning from graph data, demonstrating remarkable performance across diverse fields including social networks, product recommendations, molecular property predictions, and image analysis.

Deep-learning architectures like Graph Convolution Network (GCN) and Graph Attention Network (GAT) have gained prominence in graph representation learning, excelling in tasks such as node classification, link prediction, and graph classification However, in graph anomaly detection (GAD), traditional Graph Neural Networks (GNNs) often face challenges in maintaining consistent performance due to the distinct characteristics of anomalous nodes, which are typically sparse and significantly deviate from normal patterns This complexity complicates effective signal filtering and propagation through network layers For instance, in real-world fraud detection, fraudulent nodes create both homophily and heterophily connections with benign nodes, making it difficult for GNNs to differentiate between fraudulent and normal behaviors Addressing GAD is crucial for practical applications, including fraud detection in financial systems, identifying network intrusions, and uncovering unusual patterns in social networks.

There are two major challenges when applying traditional GNNs to graph anomaly detection (GAD): the high degree of heterophily in the graph, where connected nodes

Fraud detection in real-world graphs often involves nodes that belong to different classes or display distinct features This heterophily challenges the fundamental assumption of many Graph Neural Networks (GNNs), which is that neighboring nodes share similar attributes or labels Additionally, the limited availability of labeled training data restricts the model's capacity to generalize effectively.

Message-passing in traditional Graph Neural Networks (GNNs) often results in the aggregation of noisy or irrelevant information, leading to a significant drop in performance Architectures like GCN and GAT tend to smooth input graph signals, making it difficult to distinguish between normal and abnormal nodes To address this challenge, alternative methods have been proposed, including specialized message aggregation, neighbor sampling-based approaches, and more flexible filters Notably, Beta-Wavelet filters GNN (BWGNN) and its extensions utilize the graph spectrum to design band-pass filters that stabilize signal propagation in the presence of anomaly nodes However, BWGNN still encounters difficulties related to the limited number of basis functions, which affects its effectiveness in Graph Anomaly Detection (GAD) tasks.

Deep graph-based learning models often face the challenge of over-smoothing, which is exacerbated in Graph Anomaly Detection (GAD) tasks due to limited training data In datasets like Amazon, the proportion of abnormal nodes typically constitutes less than 10% of the total nodes.

The T-Finance dataset presents challenges, particularly in semi-supervised learning scenarios where only a limited number of nodes are labeled for training To address this issue, various learning frameworks have been proposed, including the consistency framework.

The use of a node's consistency score enhances the training process, while graph-based data augmentation addresses training data shortages by adding new nodes or edges Additionally, removing nodes and edges can decrease graph heterophily, facilitating easier signal propagation in Graph Convolutional Networks (GCNs) Inspired by data augmentation techniques in image and text processing, mixup augmentation has emerged as a method to create new training samples through the interpolation of example pairs.

In supervised learning, mixup improves robustness by blending input features and labels For graph data, it creates synthetic samples to address class imbalance and feature sparsity Various mixup augmentation methods, such as Vanilla mixup, NodeMixup, and Structural mixup, aim to enhance feature diversity and support graph-based learning However, these approaches struggle with heterophily in GAD tasks and may introduce noisy edges, which can diminish anomaly detection sensitivity This underscores the necessity for a more effective method that enhances signal propagation while preserving the unique characteristics of anomaly nodes.

We introduce the Beta-Wavelet Mixup (BWMixup) architecture to address the intertwined challenges of graph data analysis This innovative approach combines the Beta-Wavelet filter with a specialized mixup operator, enhancing the wavelet transform's ability to manage locality in spectral and spatial domains However, the filter's layer constraints can limit signal propagation among neighboring nodes, making it less adaptable to global information and susceptible to under-reaching issues To counteract this, BWMixup utilizes node-based augmentation by leveraging the similarity of latent domain features to create new edges, thereby increasing graph connectivity and providing a richer context for anomaly detection Additionally, our method incorporates a mixup strategy that not only introduces new edges but also blends the input signals of labeled nodes, resulting in more robust training against noisy anomaly signals Evaluated across five real-world graphs focused on anomaly detection, BWMixup consistently shows significant performance improvements, underscoring its practical utility and broad applicability in tackling graph anomaly detection challenges.

Contributions

To summarize, our contributions in this thesis are:

This article investigates the limitations of the Beta-Wavelet GNN (BWGNN) architecture in graph anomaly detection (GAD), focusing on the under-reaching phenomenon This issue arises when supervision signals from labeled nodes do not effectively reach distant, unlabeled nodes, which is particularly problematic in semi-supervised GAD settings with sparse anomalies and limited labels Our analysis offers empirical evidence of the detrimental effects of this problem on model generalization and detection performance.

BWMixup is a novel framework designed to tackle the under-reaching issue in graph anomaly detection (GAD) tasks It combines Beta-Wavelet spectral filtering with a customized intra-class mixup augmentation strategy, effectively preserving the high-frequency characteristics of anomalous signals This innovative approach enhances the flow of information from labeled nodes to unlabeled areas of the graph, while also addressing the structural irregularities and class imbalances commonly found in GAD tasks, setting it apart from traditional augmentation methods.

We introduce a novel latent similarity-based sampling strategy for selecting node pairs for mixup, utilizing the latent representation space derived from the Beta-Wavelet filter This method ensures that only nodes belonging to the same (pseudo) class are mixed, effectively preventing the introduction of noisy inter-class edges Furthermore, the strategy takes into account node degree information, prioritizing nodes that are significantly impacted by the under-reaching issue, particularly those with low connectivity.

We conducted extensive experiments on five publicly available graph anomaly detection datasets, demonstrating that BWMixup consistently outperforms state-of-the-art baselines in AUC, AUPRC, and Recall@K metrics Ablation studies confirm the effectiveness of the mixup components, while comparative analyses showcase the advantages of our method over existing graph mixup strategies.

Background and Related Work 6 2.1 Background

Graph Neural Network Concepts and Notions

In this section, we introduce the fundamental concepts related to static graph modeling, which form the basis for graph-based learning and spectral analysis.

Definition 2.1 (Stactic graph) We consider an undirected graph, denoted by, 𝐺 ⟨𝑉 , 𝐸, 𝑋⟩, where:

• 𝐸={𝑒 𝑖𝑗 |1≤𝑖, 𝑗 ≤𝑛} ⊆𝑉×𝑉 is a set of edges, and

• 𝑋 ∈ ℝ 𝑛×𝑑 is the node attribute matrix, where each row 𝑥 𝑖 ∈ 𝑅 𝑑 represents the feature vector of node 𝑣 𝑖

Definition 2.2 (Adjacency Matrix) The adjacency matrix 𝐴 ∈ ℝ |𝑉|×|𝑉 | encodes the connectivity between nodes in the graph Each element𝐴 𝑖,𝑗 is defined as:

Definition 2.3 (Node degree) The degree of a node 𝑣 𝑖 ∈ 𝑉, denoted 𝑑𝑒 𝑔(𝑣 𝑖 ), is the total number of edges connected to𝑣 𝑖 It is computed as the sum of the corresponding row in the adjacency matrix:

Definition 2.4 (Degree matrix) The degree matrix 𝐷 ∈ ℝ 𝑛×𝑛 for a graph 𝐺 is a diagonal matrix where each diagonal element corresponds to the degree of a node It is defined as:

Definition 2.5 (Laplacian Matrix) Given a graph 𝐺 =(𝑉 , 𝐸) with adjacency matrix

𝐴and degree matrix𝐷, the unnormalized Laplacian matrix 𝐿is defined as:

Definition 2.6 (Normalized Laplacian Matrix) The normalized graph Laplacian matrix𝐿is defined as:

𝐿=𝐼−𝐷 − 1 / 2 𝐴𝐷 − 1 / 2 where𝐼 ∈ℝ 𝑛×𝑛 is the identity matrix, and 𝐷 is the diagonal degree matrix This form of Laplacian is widely used in spectral graph theory due to its favorable mathematical properties.

Based on these definitions, standard GNNs can propagate information by defining two functions: an aggregate and update function The aggregate function gathers feature information from a node’s neighbors:

The update function combines the aggregated message with the node’s current state to produce its new representation:

In Graph Convolutional Networks (GCNs), the feature vector of node $i$ at layer $k$ is represented as $h^{(k)}_i$, while $\mathcal{N}(i)$ indicates the set of its neighboring nodes GCNs, introduced by [4], utilize an efficient aggregation method based on the normalized graph Laplacian.

(2.3) where W (𝑘) is a learnable weight matrix and𝜎is a nonlinearity (e.g ReLU) By applying this at each layer, each node effectively averages its own and its neighbors’ features, then applies a linear transform.

Graph Attention Networks: [10] replaced the fixed averaging in GCN with self- attentional weights:

Here,𝑎 and W are learned parameters, and multihead attention can be used to stabilize learning.

Both Graph Attention Networks (GAT) and Graph Convolutional Networks (GCN) function as low-pass filters on graphs, with their characteristics significantly influenced by the signal distribution within the graph This distribution can be categorized into two types: Homophily, where similar nodes connect, and Heterophily, where dissimilar nodes are linked.

Definition 2.7 (Homophily and Heterophily) An important characteristic of graphs in learning tasks is the node similarity assumption A graph is said to exhibit homophily when connected nodes tend to belong to the same class or share similar features. Formally, homophily is defined as:

𝕀(𝑦 𝑖 ≠𝑦 𝑗 ) where 𝑦 𝑖 and 𝑦 𝑗 are the class labels of nodes 𝑣 𝑖 and 𝑣 𝑗 , and𝕀 is the indicator function which equals 1 if the condition holds and 0 otherwise.

The AMNet architecture utilizes spectral graph filtering with high-pass and low-pass filters to effectively capture multi-frequency components of node features It employs a node-level attention mechanism to adaptively fuse the embeddings $Z_H$ and $Z_L$ The resulting representation $Z_i$ is then utilized for predicting anomalies at the node level.

A high homophily ratio, nearing 1, signifies a homophilous graph, whereas a low ratio, close to 0, indicates a heterophilous structure Understanding the level of homophily is essential for choosing or creating suitable graph learning models, especially in applications like anomaly detection, where heterophily is common.

In addition to the global homophily ratio, the local homophily and heterophily degrees of a node𝑣are defined as:

𝑗 ∈ 𝒩 (𝑣 𝑖 ) |𝑦 𝑣 𝑖 ≠𝑦 𝑣 𝑗 where𝒩 (𝑣)denotes the neighborhood of node𝑣.

Graph Spectral Filtering

The graph Laplacian matrix 𝐿 is a real symmetric and positive semi-definite matrix, which allows for an eigendecomposition of the form:

• Λ = 𝑑𝑖𝑎 𝑔{𝜆1, ,𝜆 𝑁 } is a diagonal matrix consisted of eigenvalues 𝜆1 ≤ 𝜆2 ≤ ≤𝜆 𝑛 ,

• 𝑈 =[𝑢 1 , , 𝑢 𝑁 ] ∈ℝ 𝑁×𝑁 is the matrix of orthonormal eigenvectors corresponding to the eigenvalues inΛ.

In graph signal processing (GSP), the eigenvectors of the Laplacian matrix serve as the Fourier bases for the graph For a graph signal represented as \$X = [x_1, , x_N] \in \mathbb{N} \times \mathbb{d}\$, where \$x_i\$ denotes the feature vector of node \$v_i\$, the graph Fourier transform of \$X\$ is defined accordingly.

Eigenvalues in the matrix Λ represent frequencies on the graph, with smaller eigenvalues capturing low-frequency (smooth) components and larger eigenvalues corresponding to high-frequency (non-smooth or anomalous) components The goal of spectral filtering is to determine a response function $ g(\hat{a}) $ on Λ to facilitate the learning of graph representation $ Z $.

𝑍= 𝑔(𝐿)𝑋 =𝑈[𝑔(Λ) ⊙ (𝑈 𝑇 𝑋)]=𝑈 𝑔(Λ)𝑈 𝑇 𝑋 where⊙is the element-wise product of vectors The Laplacian matrix 𝐿can be viewed as treated a high-pass filter of the input graph signal [28].

Graph Anomaly Detection with Spectral GNNs

Anomaly detection is a highly regarded area of research due to its widespread applications Traditional methods often focus on individual data points, neglecting the interconnections that can provide valuable insights A key strategy in graph anomaly detection (GAD) is to approach it as a semi-supervised node classification problem In this context, the sets of abnormal and normal nodes are represented as $V_a$ and $V_n$, respectively, within a network defined as $G = \langle V, E, X \rangle$, where $V = V_a \cup V_n$ Anomalies are typically labeled as positive (1), while normal nodes are labeled as negative (0).

In semi-supervised tasks, the Graph Neural Network (GNN) represented as $ f $ utilizes information from labeled nodes $ D_L $ and their corresponding labels The model is subsequently trained to predict labels for the unlabeled nodes $ D_U $.

𝑓(𝐺, 𝐷 𝐿 ) → ˆ𝑦 𝑡𝑒 𝑠𝑡 (2.7) with𝑦ˆ 𝑡𝑒 𝑠𝑡 ∈ 𝐷 𝑈 Note that GAD is an imbalanced classification problem, which results in |𝑉 𝑎 | ≪ |𝑉 𝑛 | We also consider the sparse data training settings |𝐷 𝐿 | ≪ |𝐷 𝑈 | [7].

Standard Graph Neural Networks (GNNs), like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), utilize polynomial expansions to extend the receptive field from 2D images to graph structures However, these methods lack efficiency in effectively distinguishing local variations within the graph.

Our node location encoding method in GAD utilizes band-pass filters derived from the Beta-Wavelet transform to effectively capture both local and global structural properties of a node For an input signal $ x $ over the graph $ \mathcal{G} $, the wavelet transform $ \psi $ is defined based on a family of wavelet bases $ \Psi_z = \{ \psi_{z0}, \psi_{z1}, \ldots, \psi_{zc} \} $.

=𝑈[𝑑𝑖𝑎 𝑔(𝑔 𝑖 (𝜆))]𝑈 𝑇 (𝑥) where 𝑔 𝑖 denotes the kernel function defined on the spectral domain of 𝜆 The 𝑑𝑖𝑎 𝑔(𝑔 𝑖 (𝜆)) represents a 𝑁 × 𝑁 diagonal matrix with the values on the diagonal correspondent to 𝑔 𝑖 (𝜆 𝑗 )for the eigenvalues𝜆 𝑗 of{𝜆 𝑗 } 𝑁

Wavelet transforms utilize a unique set of bases that act as band-pass filters across various scales while adhering to the admissibility condition Each wavelet base, denoted as $ \psi_{zi} $, is associated with a diffused signal from a central node on a graph at a specific scale These filters enable the model to concentrate on distinct frequency bands linked to anomalies while maintaining local topological information This approach enhances the locality of each node, reflecting the anomaly characteristics effectively The Beta-Wavelet GNN (BWGNN) proposed by the authors in [12] is tailored to learn specialized filters for Graph Anomaly Detection (GAD) tasks, demonstrating significant improvements over other baseline models However, it faces limitations in signal propagation from labeled to unlabeled nodes, as each layer in BWGNN corresponds to a base that primarily captures the locality of labeled nodes, making it more susceptible to the under-reaching issue compared to standard GNNs.

Mixup on Graphs

In recent years, Mixup has become a significant data augmentation method that enhances generalization in supervised learning, especially when labeled data is scarce Originally introduced for image classification, Mixup creates synthetic training examples through linear combinations of existing data points.

Mixup enhances image classification by creating synthetic samples through the interpolation of both image pixels and labels, as illustrated in Figure 2.2 (left) In node classification, Mixup necessitates the interpolation of node features along with their respective receptive field subgraphs, acknowledging that each node's representation is shaped by its structural context For two samples $(x_i, y_i)$ and $(x_j, y_j)$, where $x$ represents the input feature and $y$ denotes the one-hot class label, the Mixup operation is defined accordingly.

𝑦˜=𝜆𝑦 𝑖 + (1−𝜆)𝑦 𝑗 (2.10) where𝜆∈ [0,1]is a mixing coefficient typically drawn from the Beta distribution:

The parameter \$\lambda\$ follows a Beta distribution, specifically \$\lambda \sim \text{Beta}(\alpha, \alpha)\$ with \$\alpha > 0\$ This approach allows Mixup to enhance the training distribution by leveraging the prior knowledge that interpolating features results in corresponding interpolations of their associated labels.

In the realm of graphs, particularly for node-level tasks like node classification, applying Mixup necessitates careful consideration due to potential disruptions in meaningful neighborhood information when interpolating node features Recent research has introduced various adaptations of Mixup tailored for the graph domain, including Manifold Mixup, which executes Mixup within the hidden layers of a single neural network.

Given two input instances (𝑥 𝑖 , 𝑦 𝑖 ) and (𝑥 𝑗 , 𝑦 𝑗 ) and a randomly chosen layer 𝑙, the representations at that layer are interpolated as follows:

The interpolated representation, denoted as \$\tilde{h}(l) = \lambda h(l)_i + (1 - \lambda) h(l)_j\$, where \$h(l) = f(l)(x_i)\$, represents the hidden state of input \$x_i\$ at layer \$l\$ This representation is subsequently passed through the remaining layers to generate predictions The method is designed to smooth decision boundaries and enhance model robustness by promoting linear behavior within the learned feature manifold.

In contrast, [3] introduces a general framework to adapt Mixup to graph learning.

It employs a dual-encoder architecture, where each input is processed independently through a shared GNN encoder 𝑓 𝜃 (ã) The outputs of the two branches are then interpolated in the latent space:

The mixed embedding𝑧˜ is then passed through a classifier to obtain predictions.

Recent advancements in structure-aware Mixup techniques have integrated graph topology into the augmentation process S-Mixup enhances generalization by creating virtual nodes through Mixup and strategically linking them to existing graph nodes based on edge importance gradients, ensuring that new nodes inherit relevant local context In contrast, NodeMixup addresses the limitations of shallow GNNs in semi-supervised settings by facilitating both intra-class and inter-class Mixup, which improves local smoothness and promotes discriminative learning To ensure meaningful pair selection, Neighborhood Label Distribution (NLD) Sampling is employed to identify node pairs based on their label distribution and structural similarity.

𝑖,𝑣),˜ (𝑣 𝑗 ,𝑣)˜ (2.14)This simple yet effective design allows NodeMixup to propagate supervision signals more broadly without increasing network depth.

Figure 2.3: The convolution architecture in graph proposed by [4]

Methodology 19 3.1 Beta-Wavelet Filtering for Classifying Node Anomalies

Mixup Augmentation with Latent Sampling

To enhance the performance of BWGNN, we propose a mixup augmentation based on the output of the function $ g_\theta $ This mixup operator samples similar nodes in the latent space, combining their features and structures by introducing edges between them To minimize the risk of increasing heterophily through the addition of new edges, we focus on sampling pairs of nodes exclusively from intra-class values Initially, we learn the hidden output.

The equation \$Z = g_\theta(\lambda)X\$ defines the relationship between input features \$X\$ and the latent nodes \$Z = \{z_1, z_2, \ldots, z_N\}\$, where each \$z_i\$ represents a latent node in the graph The latent nodes \$Z_i\$ capture complex structural features of the graph following a wavelet transform, effectively encoding local information about the nodes and their environments, which is essential for detecting anomalous behaviors To further improve the clarity of these representations in the latent domain, we apply the sharpening trick.

In the context of the model, the temperature parameter $ \tau $ (where $ 0 < \tau < 1 $) regulates the distribution's sharpness The predictions for both the labeled dataset $ D_L $ and the unlabeled dataset $ D_U $ are generated using the MLP module in conjunction with the softmax layer.

Given a fixed threshold𝜏, the cut-off threshold is defined as:

𝑇 is the used to compute the pseudo label𝑦ˆ 𝑖 with𝑖 ∈𝐷 𝑈 :

The mixup strategy for classifying node anomalies must consider graph heterophily effects, unlike other mixup strategies designed for general node classification tasks To sample new edges, we utilize the output latent \$z_i\$ and the pseudo label \$\hat{y}_i\$ to compute similarity scores in the latent space, preserving key distinctions between nodes even in heterophilous environments The BWGNN model leverages wavelet-based latent representations to avoid over-smoothing and maintain the discriminative power of node embeddings Additionally, we construct a similarity matrix \$S\$ based on the cosine similarity between \$z_i\$ and \$z_j\$, which helps determine the likelihood of nodes \$i\$ and \$j\$ sharing similar neighbor patterns.

For a labeled node $i$, the sampling weight $s_{ij}$ of an unlabeled node $j$ is valid only when the pseudo-labels $\hat{y}_j = y_i$ This condition is crucial to avoid the creation of new heterophily edges, which could dilute the signal and diminish predictive performance.

To reduce the effects of over-smoothing, a further sharpening step with the nodes’ degree is employed:

The sampling process for mixup pairs of nodes begins with each $ i \in D_L $, matching it with a new $ j \in D_U $ according to the row $ s_i $ of $ S $ For each selected pair $ (i, j) $, the mixup operator is applied to the input features.

𝑥ˆ 𝑖 =𝜆𝑥 𝑖 + (1−𝜆)𝑥 𝑗 (3.12) where 𝑥 𝑖 and𝑥 𝑗 are the original features;𝜆is the mixup weight, which is drawn from a Beta distribution The𝑋ˆ ={ ˆ𝑥 𝑖 |𝑖 ∈𝐷 𝐿 } ∪ {𝑥 𝑗 |𝑗 ∈𝐷 𝑈 }is the new input features after the mixup operator.

To reduce the under-reaching issue, we further augment the local structure of the labeled node𝑖 by mixing it with the local structures of unlabeled node 𝑗:

𝐸ˆ=𝐸∪ {(𝑖, 𝑢)|𝑢 ∈𝑁 𝑗 } (3.13) where 𝐸ˆ is the new edge lists, 𝑁 𝑗 is the neighbor set of node 𝑗 With the pseudo-labels

The intra-class mixup module is applied to blend the local structures of each class, resulting in a new augmented graph $ \hat{G} = \langle V, X, \hat{E} \rangle $ This augmented graph is utilized to train the Beta-Wavelet module $ g_\theta $ with the pseudo label $ \hat{y} $ derived from the unlabeled dataset $ D_U $.

Our method emphasizes the extraction of strong features for each class by interpolating between sampled node pairs within that class In the realm of GAD, anomalies frequently become obscured by the dominance of normal nodes in their vicinity, resulting in a notable class imbalance Recent studies have addressed this issue.

Inter-class edges between anomaly and normal nodes can introduce noise, negatively impacting the learning performance of Graph Neural Networks (GNNs) This issue arises from the mixing of nodes from different classes, which obscures the unique characteristics of anomalies and weakens their critical signals To tackle this challenge, our method focuses on preserving class-specific information by employing only intra-class mixup operations This approach enhances feature diversity within each class while maintaining the model's sensitivity to anomalies, ultimately improving detection accuracy and robustness in Graph Anomaly Detection (GAD) tasks.

The BWMixup sampling method enhances the Neighborhood Label Distribution approach from NodeMixup by utilizing latent wavelet-transformed features and incorporating node degree information, addressing the instability of NodeMixup in GAD tasks This method effectively captures complex relationships and reduces noise, making it particularly beneficial for high-heterophily graphs with inconsistent label distributions By focusing on low-degree nodes, which are more vulnerable to under-reaching issues, the sampling process improves model robustness and performance Unlike Manifold Mixup, which directly manipulates latent representations, our approach uses the latent space solely for sampling new edges, preserving essential structural information while minimizing noise impact The incorporation of cosine similarity in the latent space allows for more accurate node pairing based on feature similarity, ensuring relevant nodes are selected for the mixup module Additionally, latent domain sampling adjusts node importance dynamically by considering node degrees, enhancing centrality and information dissemination potential These improvements lead to more diverse and representative node pairs, boosting model generalization and reducing overfitting.

Experiments 30 4.1 Experimental setup

Datasets

For GAD performance purposes, we use five real-world graph datasets, which include:

The Amazon dataset aims to identify users who are compensated for writing fraudulent reviews in the Musical Instruments category on Amazon.com It consists of multi-relation graphs featuring 25-dimensional attributes and three types of relationships: users are linked if they have reviewed a common product (U-P-U), assigned the same star rating to any product within the same week (U-S-U), or demonstrated similar review text within the top 5% (U-V-U) For our analysis, we focus solely on the single relation version of this dataset.

The Yelp-Chi dataset is designed to identify anomalous reviews that either unfairly promote or undermine businesses on Yelp.com It includes filtered and recommended hotel and restaurant reviews, organized in a graph structure with three types of edges: R-U-R (reviews by the same user), R-S-R (reviews for the same product with identical star ratings), and R-T-R (reviews for the same product posted in the same month) For the GAD classification task, we do not utilize the multi-relation information present in this dataset.

The T-Finance dataset is specifically created to detect anomalous accounts in transaction networks, featuring a single relationship type Each node in the dataset represents a unique anonymized account, defined by 10-dimensional features such as registration duration, login behavior, and interaction frequency The edges in the network signify transaction records between pairs of accounts, with abnormal nodes identified as those involved in fraudulent activities, money laundering, and online gambling.

The Questions dataset, sourced from the question-answering platform Yandex Q, aims to identify anomalous accounts by analyzing user interactions over a one-year period from September 2021 to August 2022 In this dataset, users are represented as nodes, with edges connecting them when one user answers another's question Focusing on users interested in "medicine," the goal is to predict user activity on the platform by the end of the observation period Node features are derived by averaging FastText embeddings from each user's profile description, along with a binary indicator for users without a description.

The Tolokers dataset is a network of crowd-sourcing platform workers sourced from Toloka, where an edge connects two tolokers if they have collaborated on the same task Node features are based on the workers' profile information and their task performance statistics Anomaly nodes are identified as tolokers who have been banned from participating in any projects.

Dataset statistics are provided in Table 4.1.

Baselines

We compare our BWMixup with different baselines, which can be categorized into three groups:

• The standard GNNs which include Graph Convolutional Networks (GCN)

[4], Graph Isomorphism Network (GIN) [61], Graph Sample and Aggregate (GraphSAGE) [62], Graph Attentional Network (GAT) [10], Boosted Graph Neural Network (BGNN) [63].

The second category focuses on Graph Neural Networks (GNNs) tailored for anomaly detection We assess four spatial GNNs: the Graph-based Anti-Spam Model (GAS), Deep Cluster Infomax (DCI), Pick and Choose GNN (PC-GNN), and the GAT with ego- and neighbor-embedding separation (GATSep).

We assess various spectral graph neural networks (GNNs) including the Bernstein Approximation (BernNet), Adaptive Multi-frequency GNN (AMNet), Graph Heterophily Reduction Network (GHRN), and Beta-Wavelet Graph Neural Network (BWGNN).

Implementation Details

We employ semi-supervised settings for model evaluation, which follows the settings in

The training set consists of 100 labeled nodes, categorized into 20 positive labels representing anomalous nodes and 80 negative labels for normal nodes The remaining nodes are evenly split between the validation set and the test set, with the $D_U$ set in our BWMixup encompassing both the validation and test sets.

All baseline methods are initialized using the same parameters from their official codes We train each model for 50 epochs with the Adam optimizer, saving the model that achieves the highest AUROC on the validation set Common hyperparameters include a learning rate selected from {0.01, 0.005, 0.001} and hidden dimensions $d_z$ from {8, 16, 32, 48, 64} Individual hyperparameters for each model are fine-tuned within the ranges suggested by the authors In our mixup module, the confidence threshold is set at $\tau = 0.8$, while the weight of the augmentation loss $\gamma$ is chosen from {0.1, 0.3, 0.6, 0.9, 1.2, 1.5}.

Table 4.1 presents essential information about our five evaluation datasets, detailing the number of nodes and edges, the input feature dimensions for each node, the proportion of anomaly nodes, the training ratio, and the types of node features.

Dataset #Nodes #Edges #Feat Anomaly Feature Type

Metric

Given that Generalized Anxiety Disorder (GAD) presents a class-imbalanced classification challenge, we employ three standard metrics for a fair evaluation: the Area Under the Receiver Operating Characteristic Curve (AUROC), the Area Under the Precision-Recall Curve (AUPRC) calculated through average precision, and the Recall score within the top-K predictions (Rec@K).

In our study, we define K as the number of anomalies present in the test set, treating anomalies as the positive class across all metrics, where higher scores reflect improved model performance Each model is evaluated using five different seeds, and the results are presented as averages All experiments are conducted on a GPU-RTX 3090, encompassing both baseline models and our proposed model.

Results

The performance of our model, BWMixup, is superior across various datasets, as shown in Tables 4.2 and 4.3 Table 4.2 compares results from datasets with a low percentage of anomaly nodes, while Table 4.3 focuses on those with a high percentage BWMixup achieves the best AUC performance in all testing datasets and excels in AUPRC metrics, scoring highest in 3 out of 5 datasets Although the GAS and BWGNN models outperform BWMixup in the Amazon and Tolokers datasets, respectively, our model leads in Rec@K metrics, achieving the best results in 4 of the 5 datasets These findings underscore BWMixup's effectiveness in detection accuracy and its ability to address key limitations in GAD, particularly in low anomaly scenarios where class imbalance is pronounced The use of intra-class mixup generates synthetic nodes that maintain class boundaries and enhance sample representation, while the Beta-Wavelet filter facilitates the propagation of informative signals among structurally distant yet semantically similar nodes.

Our model outperforms the baseline BWGNN by 2%-4%, particularly excelling with the Amazon and T-Finance datasets This performance advantage is largely due to BWMixup's capability to tackle under-reaching and feature sparsity issues Unlike BWGNN, which is constrained by the number of basis functions and struggles with signal propagation to under-represented nodes, BWMixup employs mixup to create virtual nodes that enhance the local neighborhood This approach allows for better anomaly detection that might otherwise be overlooked in low-frequency propagation Additionally, BWMixup consistently demonstrates the lowest variance among all tested baselines across different seeds, effectively addressing under-reaching and enhancing prediction stability through data perturbation.

Standard GNN models, particularly GCN, often underperform in anomaly detection due to their low-pass filter limitations In contrast, the GAT model excels with its attention mechanism, which enables it to prioritize relevant neighbors for information aggregation, essential for identifying anomalies However, GAT still faces challenges in capturing global structural inconsistencies, especially in high heterophily scenarios BWMixup improves upon this by creating interpolated nodes that reflect intra-class similarity, enhancing the ability to differentiate anomaly nodes Overall, our proposed method significantly outperforms standard GNNs across various metrics and datasets, effectively addressing the unique challenges of anomaly detection tasks.

GAD methods generally outperform standard GNNs, with the exception of the Question dataset The PC-GNN, a specialized GAD classifier, modifies the adjacency matrix to reduce noise edges and maintain balanced neighborhood label frequencies In contrast, BWMixup enhances information richness by creating new edges and augmenting graph features, thus minimizing the risk of information loss associated with edge deletion These approaches serve as competitive baselines, and experimental results demonstrate that the proposed method consistently surpasses them across all metrics and datasets This suggests that the mixup strategy in the proposed models effectively captures richer structural information, resulting in improved detection accuracy.

Spectral-based GNNs demonstrate superior performance over standard GNNs and GAD methods, particularly on the Amazon dataset with an anomaly heterophily of 0.9254 This performance gap is attributed to the detrimental effects of class imbalance in the GAD problem, which increases heterophily and exacerbates the under-reaching issue The high heterophily rate limits the model's ability to leverage valuable local information, relying instead on distant nodes To effectively tackle these challenges, BWMixup integrates Beta-Wavelet filtering and intra-class mixup, enhancing signal propagation and stability in high-heterophily environments.

Table 4.2: The experiment results on Amazon, T-Finance, Questions dataset The best, second, and third performances are highlighted withbold, underlined, anditalics

Metrics AUC ↑ AUPRC ↑ Rec@K ↑ AUC ↑ AUPRC ↑ Rec@K ↑ AUC ↑ AUPRC ↑ Rec@K ↑

GCN 81.91 ± 0.03 32.87 ± 0.27 38.73 ± 1.31 86.58 ± 1.23 58.85 ± 2.31 58.59 ± 2.15 60.47 ± 0.66 6.20 ± 0.20 9.92 ± 0.83GAT 90.63 ± 0.82 78.16 ± 1.50 73.69 ± 2.19 84.57 ± 0.42 40.49 ± 2.00 36.00 ± 2.77 60.46 ± 0.33 5.16 ± 1.15 6.97 ± 1.44GIN 83.60 ± 1.05 42.59 ± 1.20 52.33 ± 1.78 85.04 ± 1.20 51.23 ± 4.31 56.82 ± 2.01 61.09 ± 0.46 7.15 ± 0.92 9.32 ± 1.12GraphSAGE 80.18 ± 1.45 36.02 ± 2.25 45.27 ± 1.09 67.82 ± 2.19 10.11 ± 0.85 14.23 ± 1.30 59.03 ± 1.20 7.30 ± 1.01 6.45 ± 1.01BGNN 82.41 ± 0.60 32.78 ± 1.61 36.40 ± 1.93 91.86 ± 0.43 67.96 ± 2.34 70.01 ± 1.49 58.10 ± 1.82 5.38 ± 1.67 7.21 ± 1.11PCGNN 91.80 ± 0.24 80.72 ± 0.81 77.14 ± 1.17 91.67 ± 0.42 59.21 ± 3.77 64.79 ± 2.39 57.26 ± 0.92 3.78 ± 0.21 6.22 ± 0.66GAS 90.22 ± 0.28 83.57 ± 2.38 77.91 ± 0.88 85.87 ± 1.15 44.04 ± 4.59 44.32 ± 3.02 58.43 ± 0.39 6.19 ± 1.45 7.03 ± 0.97DCI 86.59 ± 2.44 56.16 ± 3.94 59.34 ± 3.29 87.59 ± 1.76 55.12 ± 2.29 58.32 ± 2.37 57.11 ± 1.40 5.86 ± 1.50 6.38 ± 0.93GATSep 91.49 ± 0.41 80.33 ± 0.75 74.88 ± 0.45 84.10 ± 4.55 34.87 ± 3.21 40.69 ± 3.77 59.93 ± 2.56 4.58 ± 1.02 7.14 ± 0.38BernNet 90.11 ± 0.89 80.72 ± 1.23 77.01 ± 1.02 91.09 ± 1.17 54.16 ± 2.31 57.38 ± 3.22 62.93 ± 1.32 6.39 ± 1.40 9.33 ± 0.61AMNet 92.02 ± 0.49 81.03 ± 0.96 76.66 ± 1.15 92.80 ± 1.18 58.25 ± 2.01 59.86 ± 2.12 60.17 ± 0.56 5.46 ± 1.05 9.58 ± 1.66GHRN 91.47 ± 1.23 80.46 ± 0.58 70.74 ± 0.99 90.48 ± 0.91 56.13 ± 3.97 62.39 ± 2.49 57.62 ± 2.11 4.47 ± 0.12 7.48 ± 0.44BWGNN 91.99 ± 0.31 81.43 ± 0.94 77.33 ± 0.97 91.56 ± 1.23 63.97 ± 2.55 64.49 ± 2.15 63.06 ± 0.76 7.82 ± 1.89 9.64 ± 1.57BWMixup 94.36 ± 0.67 82.94 ± 0.32 78.75 ± 0.25 94.93 ± 0.40 72.07 ± 2.39 72.47 ± 1.12 64.42 ± 0.98 8.03 ± 1.60 10.41 ± 1.72

Table 4.3: The experiment results on Yelp-Chi and Tolokers dataset The best, second, and third performances are highlighted withbold, underlined, anditalics

Metrics AUC ↑ AUPRC ↑ Rec@K ↑ AUC ↑ AUPRC ↑ Rec@K ↑

GCN 55.95 ± 0.92 18.08 ± 0.60 21.02 ± 1.27 61.44 ± 1.46 29.89 ± 0.75 29.63 ± 0.82 GAT 66.13 ± 0.65 24.86 ± 0.56 27.97 ± 0.91 70.95 ± 0.36 34.77 ± 0.71 36.59 ± 1.03 GIN 59.87 ± 1.61 22.74 ± 0.85 24.86 ± 1.35 71.23 ± 1.01 34.32 ± 0.95 36.24 ± 0.83 GraphSAGE 63.50 ± 1.35 23.79 ± 0.92 26.38 ± 1.46 66.98 ± 1.20 33.18 ± 0.31 33.46 ± 0.96 BGNN 54.62 ± 0.82 18.17 ± 1.25 15.22 ± 2.15 61.09 ± 1.00 30.45 ± 0.70 31.67 ± 0.82 PCGNN 65.21 ± 0.74 24.66 ± 0.84 27.15 ± 0.92 66.16 ± 0.51 33.94 ± 0.62 32.32 ± 0.55 GAS 63.82 ± 1.02 23.40 ± 0.36 26.78 ± 0.12 62.84 ± 0.39 30.96 ± 0.54 31.08 ± 0.64 DCI 63.27 ± 0.91 24.96 ± 1.22 27.47 ± 0.26 68.90 ± 3.93 33.42 ± 2.06 36.81 ± 1.40 GATSep 63.06 ± 1.31 22.50 ± 1.47 26.09 ± 0.48 69.22 ± 1.21 33.80 ± 0.92 36.14 ± 1.12 BernNet 64.65 ± 1.12 23.85 ± 1.37 26.02 ± 0.96 66.96 ± 2.46 32.82 ± 1.67 32.70 ± 0.85 AMNet 65.02 ± 1.37 23.72 ± 0.83 27.09 ± 0.98 66.25 ± 2.46 32.05 ± 1.56 28.53 ± 0.42 GHRN 62.24 ± 0.88 21.34 ± 1.06 22.80 ± 0.49 67.90 ± 1.03 33.48 ± 1.31 33.95 ± 0.92 BWGNN 67.04 ± 1.04 25.14 ± 0.79 28.78 ± 1.22 70.15 ± 0.35 36.62 ± 0.93 35.38 ± 0.48 BWMixup 68.89 ± 0.59 25.87 ± 0.16 29.73 ± 0.62 71.40 ± 0.21 36.23 ± 0.94 36.25 ± 0.76

4.2.2 Ablation Study on the Under-reaching Issue

To investigate the impact of our proposed mixup module on the under-reaching problem, we analyze the prediction errors of BWMixup and BWGNN across various K-hop distances in the Yelp-Chi and T-Finance datasets To demonstrate the effectiveness of our approach in addressing under-reaching issues, we introduce a variant of our mixup augmentation, referred to as BWMixup w/o Edges, which excludes mixed edges In this configuration, the graph $ \hat{G} = \langle V, E, X \rangle $ comprises only the altered features from the labeled node set $ D_L $ The experimental results are presented in Fig 4.1.

Our framework utilizing mixup with latent representation demonstrates consistently lower classification errors, especially as the hop number increases The intra-class mixup module, combined with latent-based sampling, effectively mitigates the under-reaching issue in high heterophily settings In contrast, the BWGNN model experiences a significant rise in error rates with increased hops in the T-Finance dataset, indicating difficulties in accurately capturing patterns as the model explores further into the graph An alternative version without mixed edges shows a more gradual error increase, suggesting that the node mixing feature stabilizes the training process of the standard BWGNN While this version allows the model to learn from a broader range of data points, it does not enhance the Beta-Wavelet filter's ability to generalize to distant unlabeled nodes.

The plot in Figure 4.1 illustrates the K-hop distance $ D_L $ and classification errors for $ C=2 $ using BWMixup, BWMixup without mixed edges (BWMixup w/o Edges), and BWGNN combined with mixed additional edges The integration of these two components enables BWMixup to effectively mitigate the under-reaching issue, especially at higher hops, resulting in enhanced overall performance.

To address the under-reaching issue, the edge construction step plays a crucial role However, the introduction of new augmented edges may introduce noise into the learned model due to the inaccuracy of pseudo labels $ \hat{y}_U $ during training To assess the impact of noisy labels in this step, we experimented with two alternatives for $ \hat{y}_U $: Ground-Truth (GT) Labels and Random In the GT Labels configuration, true labels were utilized for edge sampling, while the Random configuration involved edge sampling without considering node labels Results, as shown in Table 4.4, indicate a slight performance drop in our method compared to the GT Labels setting, attributed to noise from inaccurate pseudo labels Conversely, the Random configuration demonstrated a performance decrease of 2% to 3% across both datasets Importantly, GT labels are often impractical in real-world scenarios, suggesting that our mixup method can effectively mitigate the impact of noise in pseudo labels to a certain degree.

4.2.3 Compare between Different Mixup Augmentations

We evaluate our mixup operator against several well-known alternatives for node classification in graphs, including Vanilla mixup, GraphMix, and Node mixup In this comparison, BWGNN serves as the default learning framework for all three operators.

Table 4.4: The comparison results of the effects of noisy-labels in edge-construction steps: Pseudo Labels (ours), Ground Truth Labels (GT Labels) and Random

Metrics AUC ↑ AUPRC ↑ AUC ↑ AUPRC ↑

Random 65.41 ± 2.04 24.33 ± 1.24 92.67 ± 0.31 70.37 ± 1.73 model In the Vanilla Mixup config, each pair of nodes is sampled following our latent sampling procedure, and their features are mixed together without connecting them.

In GraphMix, the mixup operator is applied directly to the hidden latent representation through randomized sampling In our NodeMixup configuration, we implement Node mixup operators that sample new edges according to pseudolabel distributions This approach incorporates both intra-edges and inter-edges based on the pseudolabel To emphasize the balance between training speed and performance, we introduce a faster variant of our model, referred to as BWMixup-.

S The BWMixup-S reduces the frequency of the augmentation operator with an updated frequent of 1-to-M, instead of 1-to-1 as in the original version More specifically, in the Algorithm 1, the constructing of𝐺ˆ and 𝐿 𝑎𝑢 𝑔 are only employed and added to the total loss𝐿each 𝑀 training epochs In our experiment, we set the updated frequency 𝑀=5, which can reduce the computing time of augmentation loss to 20% of the original one.

We also provide the baseline result of BWGNN for comparison purposes.

Table 4.5 demonstrates that our mixup augmentation method surpasses other augmentation techniques across two tested datasets, achieving an improvement of approximately 2%-4% in the AUC metric and 2%-10% in the AUPRC metric.

BWMixup-S exhibits the lowest variance among various methods, although its AUC and AUPRC scores are 1%-2% lower than the original version due to a trade-off with speed Despite this, the results indicate that BWMixup-S can enhance the base architecture of BWGNN to some extent.

The Vanilla Mixup method demonstrates superior performance when integrated with the Beta-Wavelet base module, although it results in a minor decline in baseline performance on the Amazon dataset Similarly, the NodeMixup method shows competitive results in the T-Finance dataset but experiences a slight performance drop in other datasets In contrast, the GraphMix method fails to enhance the performance of the BWGNN baseline across both datasets.

Tiêu đề	Applying data augmentation method to improve anomaly detection with graph neural network
Tác giả	Do Thu Uyen
Người hướng dẫn	Assoc. Prof. Dr. Le Thanh Ha
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2025
Thành phố	Hanoi

Định dạng
Số trang	64
Dung lượng	7,71 MB