1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận án tiến sĩ phát triển một số mạng nơ ron học sâu cho bài toán phát hiện tấn công mạng

128 0 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Developing Deep Neural Networks for Network Attack Detection
Tác giả Vu Thi Ly
Người hướng dẫn Assoc. Prof. Dr. Nguyen Quang Uy, Prof. Dr. Eryk Duzkite, Dr. Diep N. Nguyen, Dr. Dinh Thai Hoang, Dr. Cao Van Loi
Trường học Military Technical Academy
Chuyên ngành Mathematical Foundations for Informatics
Thể loại Thesis
Năm xuất bản 2021
Thành phố Hà Nội
Định dạng
Số trang 128
Dung lượng 1,45 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • Chapter 1. BACKGROUNDS (0)
    • 1.1. Introduction (24)
    • 1.2. Experiment Datasets (25)
      • 1.2.1. NSL-KDD (26)
      • 1.2.2. UNSW-NB15 (26)
      • 1.2.3. CTU13s (26)
      • 1.2.4. Bot-IoT Datasets (IoT Datasets) (26)
    • 1.3. Deep Neural Networks (27)
      • 1.3.1. AutoEncoders (28)
      • 1.3.2. Denoising AutoEncoder (32)
      • 1.3.3. Variational AutoEncoder (33)
      • 1.3.4. Generative Adversarial Network (34)
      • 1.3.5. Adversarial AutoEncoder (35)
    • 1.4. Transfer Learning (37)
      • 1.4.1. Definition (37)
      • 1.4.2. Maximum mean discrepancy (MMD) (38)
    • 1.5. Evaluation Metrics (38)
      • 1.5.1. AUC Score (39)
      • 1.5.2. Complexity of Models (39)
    • 1.6. Review of Network Attack Detection Methods (40)
      • 1.6.1. Knowledge-based Methods (40)
      • 1.6.2. Statistical-based Methods (41)
      • 1.6.3. Machine Learning-based Methods (42)
    • 1.7. Conclusion (51)
  • Chapter 2. LEARNING LATENT REPRESENTATION FOR (0)
    • 2.1. Introduction (52)
    • 2.2. Proposed Representation Learning Models (56)
      • 2.2.1. Muti-distribution Variational AutoEncoder (57)
      • 2.2.2. Multi-distribution AutoEncoder (59)
      • 2.2.3. Multi-distribution Denoising AutoEncoder (60)
    • 2.3. Using Proposed Models for Network Attack Detection (62)
      • 2.3.1. Training Process (62)
      • 2.3.2. Predicting Process (63)
    • 2.4. Experimental Settings (64)
      • 2.4.1. Experimental Sets (64)
      • 2.4.2. Hyper-parameter Settings (65)
    • 2.5. Results and Analysis (66)
      • 2.5.1. Ability to Detect Unknown Attacks (67)
      • 2.5.2. Cross-datasets Evaluation (70)
      • 2.5.3. Influence of Parameters (73)
      • 2.5.4. Complexity of Proposed Models (76)
      • 2.5.5. Assumptions and Limitations (77)
    • 2.6. Conclusion (78)
  • Chapter 3. DEEP GENERATIVE LEARNING MODELS FOR (0)
    • 3.1. Introduction (81)
    • 3.2. Deep Generative Models for NAD (82)
      • 3.2.1. Generating Synthesized Attacks using ACGAN-SVM (82)
      • 3.2.2. Conditional Denoising Adversarial AutoEncoder (83)
      • 3.2.3. Borderline Sampling with CDAAE-KNN (86)
    • 3.3. Using Proposed Generative Models for Network Attack Detection 72 1. Training Process (88)
      • 3.3.2. Predicting Process (88)
    • 3.4. Experimental Settings (89)
      • 3.4.1. Hyper-parameter Setting (89)
      • 3.4.2. Experimental sets (90)
    • 3.5. Results and Discussions (91)
      • 3.5.1. Performance Comparison (91)
      • 3.5.2. Generative Models Analysis (93)
      • 3.5.3. Complexity of Proposed Models (94)
      • 3.5.4. Assumptions and Limitations (96)
    • 3.6. Conclusion (96)
  • Chapter 4. DEEP TRANSFER LEARNING FOR NETWORK (0)
    • 4.1. Introduction (97)
    • 4.2. Proposed Deep Transfer Learning Model (99)
      • 4.2.1. System Structure (100)
      • 4.2.2. Transfer Learning Model (101)
    • 4.3. Training and Predicting Process using the MMD-AE Model 87 1. Training Process (103)
      • 4.3.2. Predicting Process (104)
    • 4.4. Experimental Settings (104)
      • 4.4.1. Hyper-parameters Setting (105)
      • 4.4.2. Experimental Sets (105)
    • 4.5. Results and Discussions (106)
      • 4.5.1. Effectiveness of Transferring Information in MMD-AE (106)
      • 4.5.2. Performance Comparison (108)
      • 4.5.3. Processing Time and Complexity Analysis (110)
    • 4.6. Conclusion (111)
    • 1.2 Structure of generative models (a) AE, (b) VAE, (c) GAN, (0)
    • 1.3 Traditional machine learning vs. transfer learning (0)
    • 2.1 Visualization of our proposed ideas: Known and unknown (0)
    • 2.2 The probability distribution of the latent data ( z 0 ) of (0)
    • 2.3 Using non-saturating area of activation function to sepa- (0)
    • 2.4 Illustration of an AE-based model (a) and using it for (0)
    • 2.5 Latent representation resulting from AE model (a,b) and (0)
    • 2.6 Influence of noise factor on the performance of MDAE (0)
    • 0.01 results in the highest AUC, and lowest FAR and MDR. 57 (0)
    • 2.7 AUC scores of (a) the SVM classifier and (b) the NCT (0)
    • 3.1 Structure of CDAAE (0)
    • 4.1 Proposed system structure (0)
    • 4.2 Architecture of MMD-AE (0)
    • 4.3 MMD of latent representations of the source (IoT-1) and (0)
    • 1.2 Number of training data samples of malware datasets (0)
    • 1.3 The nine IoT datasets (0)
    • 2.1 Hyper-parameters for AE-based models (0)
    • 2.2 AUC scores produced from the four classifiers SVM, PCT, (0)
    • 2.3 AUC score of the NCT classifier on the IoT-2 dataset in (0)
    • 2.4 Complexity of AE-based models trained on the IoT-1 dataset. 60 (0)
    • 3.1 Values of grid search for classifiers (0)
    • 3.2 Hyper-parameters for CDAAE (0)
    • 3.3 Result of SVM, DT, and RF of on the network attack datasets.77 (0)
    • 3.4 Parzen window-based log-likelihood estimates of genera- (0)
    • 3.5 Processing time of training and generating samples pro- (0)
    • 4.1 Hyper-parameter setting for the DTL models (0)

Nội dung

DEEP GENERATIVE LEARNING MODELS FOR NETWORK ATTACK DETECTION.. Using Proposed Generative Models for Network Attack Detection 72 3.3.1... These include representation learning to detect b

BACKGROUNDS

Introduction

The Internet has become an essential part of our lives, but it also presents significant security threats that hinder its growth Network attacks are a primary concern for online security, drawing considerable attention from researchers Recent studies, such as those by Zou et al., have reviewed the security requirements of wireless networks and outlined the various attacks they face Additionally, security threats in cloud computing have been analyzed, highlighting the need for effective attack detection methods to ensure the safety of information systems.

Security data plays a crucial role in identifying network traffic patterns that can signal potential security attacks, serving as a fundamental element in both the training and detection phases Various methodologies are employed to analyze this data for effective attack detection, with Non-Attack Detection (NAD) methods leveraging insights from network traffic datasets The following section will outline several commonly used network traffic datasets referenced in this thesis.

Experiment Datasets

This section outlines the experimental datasets utilized to assess the effectiveness of the proposed models The experiments are conducted using several reputable security datasets, which include two network datasets, NSL-KDD and UNSW-NB15, as well as three malware datasets derived from the CTU-13 dataset system and IoT attack datasets.

In this thesis, we utilize nine recent IoT attack datasets that encompass a variety of attacks, effectively demonstrating the efficacy of Deep Transfer Learning (DTL) techniques These datasets are particularly relevant as they reflect network traffic from different IoT devices, aligning with the assumptions of a DTL model To address the issue of imbalanced datasets, we also incorporate commonly used imbalanced datasets, including NSL-KDD, UNSW-NB15, and CTU-13.

Table 1.1: Number of training data samples of network attack datasets.

Table 1.2: Number of training data samples of malware datasets.

Classes No Classes No Classes No.

Benign 518904 Benign 292485 Benign 37000Malware 230 Malware 1420 Malware 37000

The NSL-KDD dataset addresses inherent issues found in the KDD’99 dataset and is utilized for network attack analysis Each sample consists of 41 features, categorized as either a type of attack or normal behavior The training set encompasses 24 distinct attack types, while the testing set includes an additional 14 attack types Simulated attack samples are classified into four categories: DOS, R2L, U2R, and Probing, with detailed information provided in Table 1.1.

The UNSW-NB15 dataset is developed using the IXIA PerfectStorm tool within the Cyber Range Lab at the Australian Centre of Cyber Security It encompasses nine attack categories: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms Each data sample consists of 49 features derived from the Argus and Bro-IDS tools, employing twelve algorithms to analyze network packet characteristics Detailed information about the datasets can be found in Table 1.1.

The CTU-13 dataset is a publicly available malware dataset that was captured in the Czech Technical University (CTU), Czech Republic, in

The dataset from 2011 encompasses normal traffic alongside various real-world botnet scenarios, featuring thirteen distinct types of malware, each utilizing different protocols and actions For our analysis, we selected three scenarios representing three specific malware types: Menti, NSIS.ay, and Virus Detailed information about the datasets is provided in Table 1.2.

1.2.4 Bot-IoT Datasets (IoT Datasets)

We utilize nine IoT attack-related datasets introduced by Y Meidan et al for the evaluation of our proposed models These datasets were collected from nine commercial IoT devices in a laboratory setting, focusing on two prominent IoT-based botnet families: Mirai and BASHLITE (Gafgyt) Each botnet family encompasses five distinct IoT attacks Notably, three datasets—Ennio Doorbell (IoT-3), Provision PT 838 Security Camera (IoT-6), and Samsung SNH 1011 N Webcam (IoT-7)—feature only one botnet family, each with five types of attacks The remaining datasets include both Mirai and Gafgyt, comprising a total of ten types of DDoS attacks.

After applying one-hot encoding and eliminating identifying features such as 'saddr', 'sport', 'daddr', and 'dport', each data sample consists of 115 attributes These attributes are organized into three categories: stream aggregation, time-frame, and statistical attributes Detailed information about the datasets can be found in Table 1.3.

Table 1.3: The nine IoT datasets.

Dataset Device Name Training Attacks Training size Testing size

IoT-1 Danmini Doorbell combo, ack 239488 778810

IoT-2 Ecobee Thermostat combo, ack 59568 245406

IoT-3 Ennio Doorbell combo, tcp 174100 181400

WHT Security Camera combo, ack 189055 674001

WHT Security Camera combo, ack 176349 674477

Deep Neural Networks

This section outlines the mathematical foundations of various deep neural network models that will inform the development of our proposed models in the subsequent chapters.

A deep neural network is an advanced artificial neural network characterized by multiple layers between the input and output layers, designed to approximate a function \( f^* \) It establishes a mapping \( y = f(x, \theta) \) and optimizes the parameters \( \theta \) for the best approximation This framework is particularly effective for supervised learning, as it transforms an input vector into an output vector that simplifies subsequent machine learning tasks, leveraging large models and extensive labeled training datasets.

This section presents the structure of the AutoEncoder (AE) model and the proposed work that exploits the AE’s representation.

An Autoencoder (AE) is a type of neural network designed to replicate its input at the output It consists of two main components: the encoder and the decoder The encoder, characterized by weight matrices \(W\) and bias vector \(b\), transforms the input dataset \(x = \{x_1, x_2, \ldots, x_n\}\) into a latent representation \(z_i\) The parameters for training the encoder and decoder are denoted as \(\phi = (W, b)\) and \(\theta = (W', b')\), respectively The encoder's function is to map each input sample \(x_i\) to its corresponding latent representation \(z_i\).

The decoder \( p_{\theta} \) aims to reconstruct the input space from the latent representation \( z_i \), represented as \( \hat{x}_i \) in Equation (1.2) The latent representation is defined by \( z_i = q_{\phi}(x_i) = ae(Wx_i + b) \) in Equation (1.1), where \( ae \) is the activation function of the encoder Conversely, the reconstruction process is governed by \( \hat{x}_i = p_{\theta}(z_i) = ad(W' z_i + b') \), with \( ad \) being the activation function of the decoder.

The loss function of an Autoencoder (AE) for a single sample, denoted as \(x_i\), measures the difference between the input \(x_i\) and the output \(\hat{x}_i\) For an entire dataset, this loss function is typically computed using the mean squared error (MSE) across all data samples, as expressed in Equation 1.3.

The effectiveness of Neural Architecture Design (NAD) models utilizing Autoencoders (AEs) is influenced by the choice of activation functions, as each function captures specific characteristics of the input data Recent research has focused on combining different activation functions within AE models to improve data representation For instance, a study demonstrated that integrating the hyperbolic Tangent (Tanh) and logistic (Sigmoid) functions can enhance the accuracy of latent representations in classification tasks However, the Sigmoid function is often limited by the vanishing gradient problem, which affects its performance.

AE with many layers training on a large dataset like an IoT anomaly dataset.

We proposed a method to enhance the effectiveness of Autoencoders (AE) in addressing the Network Anomaly Detection (NAD) problem By combining the Relu and Tanh activation functions, we aim to represent network traffic in a higher-level representation space Our analysis of three popular activation functions—Sigmoid, Tanh, and Relu—demonstrates that Tanh and Relu are more suitable for learning the characteristics of IoT anomaly data compared to Sigmoid The details of this proposed method are outlined below.

We design two AE models that have same network structure namely

AE 1 Let’s denote the encoder and decoder of AE 1 as En 1 and De 1 , respectively, and those ofAE2 asEn2 andDe2, respectively Let’s denote

WEn 1, bEn 1 and WEn 2, bEn 2 as the weight matrix and bias vector of the

When the output of an activation function reaches its saturated area, the gradient approaches zero, affecting the gradient encoders of Autoencoder 1 (AE 1) and Autoencoder 2 (AE 2) The corresponding decoder parameters are denoted as \(W_{De1}, b_{De1}\) for AE 1 and \(W_{De2}, b_{De2}\) for AE 2 The outputs for AE 1 are represented as \(z_1\) (Equation 1.4) and \(\tilde{x}_1\) (Equation 1.5).

In the proposed model, the equations for AE2 are defined as \( z_2 = g(W_{En}^2 x + b_{En}^2) \) and \( \tilde{x}_2 = g(W_{De}^2 x + b_{De}^2) \) The activation functions used in these equations are \( f \) and \( g \), which correspond to the Tanh and ReLU functions, respectively The earlier equations are \( z_1 = f(W_{En}^1 x + b_{En}^1) \) and \( \tilde{x}_1 = f(W_{De}^1 x + b_{De}^1) \).

The AE 1 and AE 2 models utilize activation functions f and g for all hidden layers, except for the bottleneck and final layers, which employ the Sigmoid function Each model is trained independently on different batch sizes of the training data The loss functions for AE 1 and AE 2, represented in Equations 1.8 and 1.9, calculate the mean squared error (MSE) across all data samples \(x_i\) for \(i\) in the range of 1 to \(n\).

X i=1 x i −x˜ i 2 2 , (1.9) where n is the number of training samples, (W, b) is the learning pa- rameter set of the AE model.

After training, we utilize the encoder components of each Autoencoder model, En1 and En2, to create latent representations, z1 and z2 These combined representations serve as inputs for classification algorithms, replacing the original data x This approach leverages the advantages of both Tanh and ReLU activation functions, leading to a significant enhancement in the accuracy of the classification algorithms.

AUC sigmoid tanh relu sigmoid-tanh tanh-relu

Figure 1.1: AUC comparison for AE model using different activation function of IoT-

The AUC scores during the training process are visualized, with Fig 1.1 comparing the AUC scores of Support Vector Machine (SVM) across five AE-based models in the IoT-4 dataset The results indicate that SVM struggles to classify the representation from the Sigmoid-based model, reflected in an AUC score of approximately 0.5 In contrast, the Tanh-based model achieves an AUC score close to 0.8 Furthermore, the combined Sigmoid-Tanh-based model does not surpass the performance of the Tanh-based model, primarily due to the inefficacy of the Sigmoid-based model Therefore, employing the Sigmoid function in the AE model for IoT anomaly detection proves less effective compared to the findings presented in [22].

The AUC score of the Relu-based model exceeds 0.9 during training, indicating strong performance Additionally, combining Relu and Tanh activation functions significantly boosts performance after several training epochs This improvement is attributed to the Tanh function's ability to mitigate the dying ReLU problem, enhancing the overall effectiveness of the model.

2 to handle the vanishing problem of Tanh function. x En z De x ˜

(c) Generative Adversarial Network (GAN). x En à σ ˜ z De x ˜

Figure 1.2: Structure of generative models (a) AE, (b) VAE, (c) GAN, and (d) AAE.

A Denoising AutoEncoder (DAE) is a specialized type of AutoEncoder designed to reconstruct the original input from a noisy version, allowing it to capture the true distribution of the input rather than merely learning the identity The most common method for introducing noise into the input data is through additive isotropic Gaussian noise, defined as a conditional distribution over a corrupted sample given the original data sample In this context, the noise component, denoted as \(x_{\text{noise}}\), is drawn from a normal distribution with a mean of 0 and a standard deviation of \(\sigma_{\text{noise}}\) The denoising criterion for Gaussian corruption is established to enhance the model's performance.

Let ˜x i represent the corrupted version of the input data x i, derived from the conditional distribution C(˜x|x) It is important to note that the corruption process is applied stochastically to the original input whenever a data point x i is evaluated Consequently, the loss function for the Denoising Autoencoder (DAE) can be formulated based on the loss function of the Autoencoder (AE).

X i=1 x i −pθ(qφ(˜x i )) 2 , (1.11) where ˜x i is the corrupted version of x i drawn from C(x˜|x) qφ and pθ are the encoder and decoder parts of DAE, respectively n is the number of data samples in a dataset.

Transfer Learning

Transfer Learning (TL) involves transferring knowledge from a source domain to a target domain, where the two domains differ but share related data distributions This section defines TL and discusses the distance metric utilized to measure the differences between the two data distributions in this thesis.

Transfer Learning (TL) enhances generalization in new learning tasks by leveraging knowledge gained from previous tasks Unlike traditional machine learning, where datasets and training processes are isolated, TL utilizes features and weights from previously trained models in a source domain to inform the training of models in a target domain This approach is particularly beneficial in scenarios with limited data or lack of labeled information in the target domain.

We consider the TL method with an input space X and its label space Y Two domain distributions are given, such as a source domain

DS and a target domain DT Two corresponding samples are given, i.e., the source sample DS = (XS, YS) = (x i S , y S i ) n i=1 S and the target sample

In this study, we denote the dataset as \( DT = (XT) = (x_i^T)_{i=1}^{n_T} \), where \( n_S \) and \( n_T \) represent the number of samples in the source and target domains, respectively The objective is to develop a classification model utilizing labeled data from the source domain while applying it to the unlabeled target domain.

MMD, like KL divergence, estimates the difference between two distributions but offers greater flexibility by allowing for nonparametric distance estimation Additionally, MMD eliminates the need to compute the intermediate density of the distributions The formal definition of MMD is provided in Equation 1.17.

X i=1 ξ(x i T )|, (1.17) where nS and nT are the number of samples of the source and target domain, respectively, ξ presents the representation of the original data x i S or x i T

Evaluation Metrics

In this section, we present two evaluation metrics that will be used to evaluate the performance of our proposed models These include the AUC score and the model’s complexity

The AUC score serves as a key performance metric for evaluating the effectiveness of our proposed models In addition to the AUC score, we employ various other metrics to assess different aspects of the models in each experimental scenario For instance, we utilize the Parzen window-based log-likelihood of generative models to evaluate the quality of the generated samples.

AUC, or Area Under the Receiver Operating Characteristics Curve, is determined by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) across different threshold settings The TPR indicates the accuracy of positive predictions, while the FPR represents the rate of false positives among actual negative cases.

The formula for evaluating classifier performance is given by TN + FP, where TP and FP represent the counts of correctly and incorrectly predicted samples for the positive class, while TN and FN denote the counts for the negative classes These metrics are advantageous due to their intuitive nature and ease of implementation However, they fail to differentiate between classes, which can be inadequate for assessing classifiers, particularly in the context of imbalanced datasets.

A perfect classifier achieves an FPR of 0 and a TPR of 100%, placing it in the top left corner of the ROC curve, while a worst-case classifier is located in the bottom right corner with an FPR of 100% and a TPR of 0 The area under the ROC curve (AUC) quantifies the average performance of a classification model across various thresholds, with a random classifier scoring an AUC of 0.5 and a perfect classifier scoring 1.0 Consequently, most classifiers will have AUC scores that fall between these two values.

Model complexity is primarily determined by the number of trainable parameters; a model with more trainable parameters is considered more complex The calculation of these parameters varies among different types of neural networks For fully connected layers, the number of trainable parameters is calculated using the formula \((n + 1) \times m\), where \(n\) represents the number of input units and \(m\) denotes the number of output units, with the additional term accounting for bias This calculation is essential for understanding the model size of a neural network.

In this study, we highlight the advantages of neural network architectures with fewer parameters, which include more efficient distributed training, reduced overhead for model exportation, and enhanced embedded development We compare the model sizes and complexity of deep neural network-based models by examining their number of trainable parameters Additionally, we report the inference time for each proposed model to facilitate comparison All experiments were conducted on a consistent computing platform, specifically Ubuntu 16.04 (64 bit) with an Intel(R) Core(TM) i5-5200U CPU, featuring two cores and 4GB of RAM.

Review of Network Attack Detection Methods

A Network Attack Detection (NAD) system analyzes security data to identify network attacks by selecting suitable features to represent network traffic It employs various techniques to detect malicious traffic, which can be classified into three main categories: knowledge-based methods, statistical-based methods, and machine learning-based methods.

This method relies on understanding specific network attacks through pre-defined attack signatures When incoming network traffic aligns with these signatures, it is flagged as a potential attack, representing one of the earliest techniques for addressing the Network Anomaly Detection (NAD) problem The attack signatures are defined by specific strings, and the incoming traffic is analyzed for matches While this straightforward approach effectively detects known attacks, the process can become computationally intensive with a large number of string rules.

Knowledge-based methods for Network Anomaly Detection (NAD) often utilize language descriptions or expert systems For instance, a finite state machine can be employed to manage execution flow and monitor historical data, representing normal network behaviors, with any deviations flagged as potential attacks Additionally, a defined language syntax of rules can characterize attacks, developed through collaboration between knowledge engineers and domain experts While these methods enable rapid and accurate detection of common network attacks, they fall short in identifying unknown attacks, which pose a greater threat to information security.

In statistics-based methods, normal traffic distribution is established to identify typical behaviors, allowing the NAD system to detect low-probability behaviors, or anomalies, as potential attacks These methods utilize statistical metrics, including mean, median, and standard deviation of network packets, to set a threshold When an incoming network behavior exceeds this threshold, it is classified as a network attack.

Ye et al [41] developed a normal profile for both univariate and multivariate metrics in computer system behavior, while the NAD system specifically monitors anomalies in each individual metric Additionally, Viinikka et al [42] focused on aggregating time-series features to identify anomalies Qingtao et al [43] introduced a method for detecting anomalous samples through abrupt changes in time-series data In contrast, Bhuyan et al [44] noted that statistical-based methods, although straightforward, tend to lack accuracy.

Knowledge-based and statistics-based methods are efficient and time-saving; however, they often require extensive prior knowledge of network attacks to effectively identify them.

The current network environment is ill-equipped to handle the rapid evolution of network attacks due to increasing size and complexity This lack of preparedness prevents us from gaining sufficient prior knowledge to protect our systems from potential harm Additionally, the challenge of detecting zero-day attacks, which are new and previously unknown threats, further complicates network security.

The machine learning-based approach has significantly advanced in addressing network anomaly detection (NAD) issues, effectively resolving problems associated with knowledge-based and statistics-based methods By leveraging machine learning models, these methods extract insights from vast amounts of network traffic data, enabling the identification of new characteristics and the prediction of network behaviors, distinguishing between normal activities and various types of attacks This area of research is gaining substantial attention, particularly due to the capability of machine learning-based NAD systems to detect novel or unknown attacks Machine learning methods in this context can be categorized into three main types: unsupervised learning, semi-supervised learning, and supervised learning.

Many Network Anomaly Detection (NAD) systems focus on building models without the need for attack data, utilizing unsupervised and semi-supervised samples For instance, [45] introduced the K-mean clustering algorithm to effectively reduce network packet payload size, enhancing classification accuracy Additionally, Hongchun et al [46] developed a framework for detecting attacks in wireless sensor networks, employing the Mean Shift Clustering algorithm to identify anomalous patterns that deviate from normal behavior They further utilized Support Vector Machines (SVM) to optimize the separation between normal and anomalous features Meanwhile, Nomm et al [13] proposed a semi-supervised method for identifying IoT attacks, which involved re-sampling datasets before applying Local Outlier Factor and One-Class SVM for malicious sample detection However, a significant drawback of this approach is that the sampling technique may alter the original data distribution, potentially diminishing the effectiveness of the detection methods.

Unsupervised and semi-supervised learning offer the significant advantage of identifying unknown attacks without prior knowledge or labeled data However, their effectiveness in detecting known attacks is limited, as these models are trained without specific attack information Additionally, the necessity to manually set threshold values for distinguishing between normal and anomalous data samples hinders the model's ability to learn autonomously from the data The following sub-sections will discuss previous work related to the three approaches outlined in the thesis.

1.6.3.1 Machine Learning Methods for Network Attack Detection

Support Vector Machines (SVM) are a widely used machine learning technique for Non-Attack Detection (NAD) problems, where input vectors are non-linearly mapped to a higher-dimensional feature space, resulting in a linear decision surface Numerous studies have demonstrated the effectiveness of SVM in addressing NAD issues Specifically, research has shown that various SVM models, including Linear SVM, Quadratic SVM, Fine Gaussian SVM, and Medium Gaussian SVM, achieve high accuracy in NAD applications Additionally, SVM can be utilized for feature selection, helping to identify critical features for attack detection systems.

Tree-based machine learning approaches have been extensively analyzed for network attack detection (NAD) Nadiammai et al demonstrated that Random Forest (RF) outperforms other algorithms, such as Decision Stump, ID3, and J48, when identifying network attacks using the NSL-KDD dataset RF, which utilizes bagging ensembles of random trees, has been highlighted in surveys by Paulo et al and implemented for feature selection and attack detection, confirming its popularity in NAD systems Additionally, Bahsi et al employed decision tree and K-nearest neighbor algorithms to identify the Mirai and Gafgyt botnet families, while Chawathe explored various machine learning algorithms, including J48 and RF, for detecting IoT anomalies.

Machine learning techniques excel at identifying known attacks by leveraging both normal data and previously recognized threats during training However, they struggle to detect novel attack types Additionally, the performance of these models can be compromised by issues related to imbalanced datasets.

Machine learning-based methods for network anomaly detection (NAD) demonstrate effectiveness but encounter three significant challenges: the heterogeneity and complexity of network traffic, the imbalance in network traffic datasets, and the difficulty in collecting labeled data across various network types This thesis aims to address these challenges within the NAD model by leveraging machine learning techniques.

1.6.3.2 Machine Learning Methods for Handling Imbalanced Data

Techniques for addressing the imbalance problem in datasets can be categorized into cost-sensitive learning and data re-sampling Cost-sensitive learning adjusts the algorithm by assigning greater misclassification costs to the minority class compared to the majority class For instance, Zhang et al utilized an intelligent sampling technique to create smaller balanced subsets before applying cost-sensitive SVM learning Similarly, Li et al introduced a method that assigns higher weights to the minority class when training an ensemble machine learning algorithm, such as Adaboost, to effectively manage imbalanced data.

Conclusion

This chapter outlines the essential background related to the thesis, defining the Network Anomaly Detection (NAD) problem and the common security datasets utilized in NAD methods It discusses various approaches to address this issue, including statistical, knowledge-based, and machine learning-based methods Among these, machine learning-based methods are highlighted as the most effective and widely adopted solutions for the NAD problem.

In addressing the NAD problem, it is crucial to effectively detect both known and unknown attacks on network systems Our approach involves developing deep neural network models that accurately represent network traffic in a feature space conducive to precise classification of normal and attack samples Additionally, we will tackle the imbalanced data issue that often hampers machine learning model accuracy To further enhance detection capabilities, we will implement a Domain Transfer Learning (DTL) technique to address the challenge of lacking label information in new network traffic domains, leveraging label data from related domains to improve NAD outcomes.

LEARNING LATENT REPRESENTATION FOR

Introduction

The rapid advancement of network devices and services has significantly enhanced various sectors, including healthcare, transportation, and manufacturing However, the data exchanged by these devices often contains sensitive user information, making them prime targets for hackers Additionally, the increasing diversity and number of network devices contribute to a surge in emerging network attacks Detecting these attacks within the Internet of Things (IoT) environment poses a considerable challenge, particularly due to the fast-paced evolution of network threats.

Machine learning has demonstrated significant potential in the Network Anomaly Detection (NAD) problem NAD methods can be classified into three categories based on data label availability: supervised, semi-supervised, and unsupervised learning Supervised learning relies on both labeled normal and abnormal data to create predictive models, but it struggles with unknown attacks not present in the training data In contrast, semi-supervised learning uses only labeled normal data to build generative models of normal behavior, while unsupervised learning operates without labeled data, assuming that attack samples are much fewer than normal ones Both semi-supervised and unsupervised methods are more resilient to unknown attacks, making them popular in NAD applications However, they may not perform as well as supervised methods in detecting known attacks This chapter introduces a novel approach to address these challenges.

Our proposed approach effectively distinguishes between known and unknown abnormal samples and normal samples within the latent representation space, as illustrated in Figure 2.1 This method demonstrates strong performance against both known and unknown attacks.

Unknown attacks pose significant risks to network systems as they can evade advanced security measures, leading to severe damage The complexity and heterogeneity of network attack data contribute to the challenges faced by Network Attack Detection (NAD) systems Recently, there has been a growing interest in deep learning-based NAD approaches, particularly in leveraging Autoencoders (AEs) for effective detection This chapter introduces a novel learning method that combines the strengths of supervised learning for identifying known IoT attacks with the unsupervised capabilities to detect unknown threats.

In the new representation space, normal data and known network attacks are distinctly separated into two regions: the normal region (green circle points) and the anomalous region (red plus points) We propose that unknown attacks will cluster near the anomalous region (yellow triangle points) due to shared characteristics with known attacks, facilitating their detection To achieve this feature representation, we introduce two innovative regularized autoencoders (AEs), referred to as Multi-distribution AEs.

Autoencoders (AEs), specifically the Multi-distribution Denoising Autoencoder (MDAE), are designed to learn and construct desired feature representations in their bottleneck layers, known as the latent feature space This representation enhances the performance of supervised learning-based Non-Autonomous Denoising (NAD) methods, including Linear Support Vector Machine (SVM), Perceptron (PCT), Nearest Centroid (NCT), and Linear Regression (LR) To highlight the effectiveness of these representation methods, we utilize simpler linear classifiers such as SVM, PCT, NCT, and LR, minimizing the classifiers' influence on the experimental results.

The major contributions of this chapter are as follows:

• Introduce a new latent feature representation to enhance the abil- ity to detect unknown network attacks of supervised learning-based NAD methods.

This article proposes three novel regularized autoencoders (AEs) designed to learn a new latent representation A unique regularizer term is incorporated into the loss function of these AEs to effectively distinguish between normal and abnormal samples within the latent space The resulting latent representation serves as input for classifiers aimed at identifying abnormal samples.

We conducted comprehensive experiments utilizing nine recent IoT botnet datasets to assess our models The findings indicate that our learning representation models significantly enhance the performance of simple classifiers compared to learning from original features or using latent representations generated by other autoencoders.

A comprehensive analysis of the latent representation characteristics is essential for detecting unknown attacks, particularly through cross-dataset testing and evaluating robustness across different hyper-parameter values This investigation highlights the practical applications of the proposed models.

This chapter is structured to provide a clear overview of the proposed models in Section 2.2, followed by the experimental settings outlined in Section 2.4 Section 2.5 offers a discussion and analysis of the results derived from these models Finally, Section 2.6 concludes the chapter and proposes directions for future research.

Proposed Representation Learning Models

This chapter introduces a novel latent representation designed to enhance supervised learning-based Network Anomaly Detection (NAD) methods for identifying both known and unknown attacks Additionally, it presents three innovative regularized Autoencoders (AEs) that effectively learn to create this new latent representation of data.

In our approach, we utilize autoencoder (AE)-based models to create a latent representation that distinctly separates normal samples from known attacked samples into two tightly defined regions This separation allows for the identification of unknown attacks that may share attributes with known attacks, positioning them closer to the anomaly region We introduce new regularized terms in the loss functions of AEs, incorporating data labels to effectively compress normal and known attacked data into two narrowly separated regions within the latent space The resulting latent representation serves as input for binary classifiers, such as Support Vector Machines (SVM) and Logistic Regression (LR), which ultimately produce a score to assess the abnormality of the input data sample.

This chapter presents new regularizers for classical Autoencoders (AEs) and Denoising Autoencoders (DAEs), resulting in two innovative models: the Multi-variational AE (MVAE) and the Multi-distribution AE (MAE), along with the Multi-distribution DAE (MDAE) Unlike the regularized AEs discussed in [21], which are limited to learning representations of only the normal class within a small region at the origin in a semi-supervised context, our proposed models offer a more versatile approach to effective learning of latent representations.

The Multi-distribution Variational AutoEncoder (MVAE) is an enhanced version of the Variational AutoEncoder (VAE) designed to learn the probability distributions that represent input data This approach integrates label information into the VAE's loss function, enabling the representation of data through two Gaussian distributions with distinct mean values For a given data sample \( x_i \) and its corresponding label \( y_i \), the centroid of the distribution is defined by \( y_i \) The MVAE's loss function for \( x_i \) can be computed accordingly.

(2.1) where z i,k = gφ(ǫ i,k , x i ) g is a deterministic function and ǫ k ∼ N(0,1);

K and y i are the number of samples used to reparameterize x i and the label of the sample x i , respectively.

The loss function of the Multi-Variate Autoencoder (MVAE) comprises two key components The first component, known as the reconstruction error (RE), represents the expected negative log-likelihood of reconstructing the original data from the output layer The second component incorporates label information into the posterior distribution \( q_{\phi}(z_i | x_i) \) and the prior distribution \( p(z_i) \) of the Variational Autoencoder (VAE) This results in the Kullback-Leibler (KL) divergence between the approximate distribution \( q_{\phi}(z_i | x_i, y_i) \) and the conditional distribution \( p(z_i | y_i) \) The purpose of integrating label information is to ensure that samples from each class are clustered within their respective Gaussian distributions, conditioned on the label \( y_i \) The conditional distribution \( p(z_i | y_i) \) is modeled as a normal distribution with mean \( \mu_{y_i} \) and standard deviation 1.0, expressed as \( p(z_i | y_i) = N(\mu_{y_i}, 1) \) Meanwhile, the posterior distribution \( q_{\phi}(z_i | x_i, y_i) \) is represented as a multivariate Gaussian with a diagonal covariance structure, denoted as \( q_{\phi}(z_i | x_i, y_i) = N(\mu_i, (\sigma_i)^2) \), where \( \mu_i \) and \( \sigma_i \) are derived from the sample \( x_i \) Consequently, the Multi-KL term can be reformulated accordingly.

Let \( D \), \( a_{ij} \), and \( \sigma_{ij} \) represent the dimensions of \( z_i \), the \( j \)-th element of \( a_i \), and \( \sigma_i \), respectively The \( j \)-th element of \( a_y \) is denoted as \( a_{yji} \) By utilizing the computation of the KL divergence, the Multi-KL term can be reformulated accordingly.

Taking Multi-KL term in Eq 2.3, the loss function of MVAE in Eq 2.1 finally is rewritten as follows:

(2.4) where λ is a parameter to control the trade-off between two terms in

The trade-off parameter \$\lambda\$ in the MVAE loss function is estimated by the ratio of the reconstruction error (RE) and the Multi-KL terms This approach enhances the efficiency of reducing both loss components.

This chapter explores the impact of small covariance values (10^{-3}, 10^{-2}, and 10^{-1}) on the Gaussian distributions of two classes to reduce the "tails" of these distributions Initially, during the MVAE training process, the Multi-KL term is significantly larger than the RE term, hindering the model's ability to accurately reconstruct input data As training progresses, the Multi-KL term decreases, yet both terms exhibit considerable fluctuations.

In our experiments, we set the mean values for the normal and attack classes at 4 and 12, respectively, ensuring that the distributions are sufficiently distinct These values were calibrated to optimize the performance of the MVAE model Additionally, we pre-determine the distribution centroid \( \bar{y}_i \) for each class \( y_i \) and establish the trade-off parameter \( \lambda \), which is derived from the ratio of two loss terms in our proposed models' loss function The hyper-parameter \( \bar{y}_i \) can take on two values corresponding to the normal and attack classes.

Figure 2.2: The probability distribution of the latent data (z 0 ) of MAE at epoch 0,

40 and 80 in the training process.

This section outlines the integration of a multi-distribution regularizer, denoted as Ω(z), into an AutoEncoder (AE) to develop a Multi-distribution AutoEncoder (MAE) The regularizer Ω(z) promotes the formation of a latent feature space where data classes are clustered in close proximity By incorporating class labels into Ω(z), the MAE ensures that samples from each class are tightly grouped around a specified central value The formulation of this new regularizer is detailed in Equation 2.5.

Ω(z) = ||z−ày i | | 2 , (2.5) where z is the latent data at the bottleneck layer of MAE, and ày i is a

The function Ω(z) transforms the input data into its corresponding region defined by ày i within the latent representation The latent feature space consists of multiple distributions corresponding to the number of classes Consequently, this chapter introduces the new regularized Autoencoder (AE) as the Multi-distribution AE.

In the MAE loss function, a parameter λ is utilized to balance the reconstruction error (RE) and the Ω(z) terms, as outlined in Sub-section 2.2.1 Consequently, the MAE loss function can be expressed as follows:

In the equation (2.6), the variables \(x_i\), \(z_i\), and \(\hat{x}_i\) represent the i-th element of the input samples, its corresponding latent data, and the reconstruction data, respectively The labels \(y_i\) and \(\hat{y}_i\) denote the label of the sample \(x_i\) and the centroid of class \(y_i\) The total number of training samples is denoted by \(n\) The first term in Eq (2.6) quantifies the reconstruction error (RE), which assesses the discrepancy between the input data and its reconstruction The second term serves as a regularizer, aimed at compressing the input data into distinct regions within the latent space.

To illustrate the probability distribution of the latent representation of MAE, we compute the histogram for one feature of the latent data, denoted as \$z_0\$ Figure 2.2 displays the probability distribution of \$z_0\$ for both the normal class and known attacks during the training of MAE on the IoT-1 dataset Over several epochs, the latent data becomes confined to two distinct regions within the latent representation of MAE.

This section delves into the multi-distribution Denoising AutoEncoder (MDAE), which is developed using the Denoising AutoEncoder (DAE) framework proposed in [23] For each data sample \( x_i \), a corrupted version \( \tilde{x}_i \) is generated as described in Equation 1.10 MDAE is designed to reconstruct the original input \( x_i \) from its corrupted counterpart \( \tilde{x}_i \), while also ensuring that the corresponding latent vector \( z_i \) remains close to the original representation.

(a) Description of saturating and non- saturating areas of the ReLu activation function.

(b) Illustration of the output of ReLU: two separated regions for normal data and known attacks; unknown attacks are hypothesized to appear in regions toward known attacks.

Figure 2.3: Using non-saturating area of activation function to separate known and unknown attacks from normal data. ày i The loss function of MDAE can be presented in Eq 2.7.

Using Proposed Models for Network Attack Detection

This section outlines the training and prediction processes involved in utilizing the proposed latent representation learning models for NAD The methodologies discussed are applicable to our models, including MVAE, MAE, and MDAE Therefore, we refer to all these proposed models collectively as AE-based models in this subsection.

2.3.1 Training Process latent vectors being not much larger than the input and the output of MAE and MDAE, resulting in an easy training process.

Algorithm 1 Training a AE-based model.

INPUT: x, y: Training data samples and corresponding labels.

The AE-based model with its hyper-parameters.

The classifier with its hyper-parameters.

OUTPUT: Trained AE-based model, Trained classifier.

1 Put x, y to the input of the AE-based model.

2 Training to minimize the loss function of the AE-based model as described in Section 2.2.

3 Put x, y to the trained AE-based model To get the latent representation of x is z.

4 Train the classifier with input as z, y. return Trained AE-based model, Trained classifier.

Algorithm 1 presents the training process when using the AE-based model for NAD First, this chapter trains the AE-based model on the original network attack dataset x, y This training process is executed by an optimization method (e.g., Adam) to minimize the loss function 3 of the AE-based model This step tries to train a latent representation learning model based on AE as illustrated in Fig 2.4 (a) Second, the latent representation z of the original data is received by fitting the original data to the trained AE-based model Third, a classifier is trained on the latent representation data z in a supervised manner as described in Fig 2.4 (b) Finally, we have training results, including the trained AE-based model and the trained classifier.

Algorithm 2 Predicting process based on representation learning models.

INPUT: x i : Testing data samples in the target domain

1 Putting x i to the input of the trained AE-based model to get the corresponding latent represen- tation as z i

2 Putting z i to the trained classifier to get the output y i return y i

Algorithm 2 describes the process of sample prediction for NAD using our proposed models First, we have a latent representation of an original data sample x i ,i.e., z i , by fitting it to the trained AE-based model.Second, the trained classifier predicts the label y i with the input as z i as illustrated in Fig 2.4 (b) Therefore, the classifier identifies a network traffic sample based on its latent representation instead of its original representation.

Experimental Settings

The experiments in this chapter are based on the IoT datasets pre- sented in chapter 1 This section presents the experimental sets and hyper-parameter setting used in this chapter.

Four linear classification algorithms—Support Vector Machine (SVM), Perceptron (PCT), Nearest Centroid (NCT), and Linear Regression (LR)—are utilized on the latent representations generated by MVAE, MAE, and MDAE These classifiers are chosen for their speed and ability to highlight the strengths of representation models, making them suitable for IoT networks with limited computing resources The experiments were carried out on nine IoT datasets, with all techniques implemented in Python using TensorFlow and Scikit-learn frameworks.

This study evaluates the effectiveness of four classifiers trained on the latent representations generated by MVAE, MAE, and MDAE The performance of these models is compared against standalone classifiers, such as the widely used Random Forest (RF) for NAD, as well as classifiers utilizing latent representations from AE, DAE, VAE, CNN, and DBN Four experiments were conducted to analyze the characteristics of the latent representations produced by MVAE, MAE, and MDAE.

The study assesses the effectiveness of four classifiers trained on the latent representations of proposed models in detecting unknown attacks This evaluation is compared to classifiers utilizing Autoencoders (AE), Denoising Autoencoders (DAE), Variational Autoencoders (VAE), Convolutional Neural Networks (CNN), and Deep Belief Networks (DBN), as well as those operating on the original input data with Random Forest (RF).

• Cross-datasets evaluation: Investigate the influence of the various attack types used for training models on the accuracy classifiers in detecting unknown attacks.

– Influence of the noise factor: Measure the influence of the noise level on the latent representation of MDAE.

– Influence of the hyper-parameters of classifiers: Investigate the effects of hyper-parameters on the accuracy of the classifiers working on different latent representations.

• Complexity of AE-based models: Assess the complexity of AE-based models based on the training time and the number of parameters.

We divided the IoT datasets into training and testing sets according to the scenarios outlined in Section 2.5 To facilitate model selection, we randomly chose 10% of the training data to form validation sets.

Table 2.1: Hyper-parameters for AE-based models.

Hyper-parameter Value The number of hidden layers 5 The size of the bottleneck layer [1 + √ n] [21]

The configuration of AE-based models, including AE, MAE, MDAE, and MVAE, involves a balancing parameter for the reconstruction error (RE) set at 1000 for MVAE Common hyper-parameters for these models are summarized in Table 2.1 Typically, the number of hidden layers is five, and the bottleneck layer size \( m \) is determined by the formula \( m = [1 + \sqrt{n}] \), where \( n \) represents the number of input features Due to the relatively low dimensionality of network traffic data, the number of layers is generally fewer than in other deep learning applications The batch size is set to 100, with a learning rate of \( 10^{-4} \) Weights are initialized using the methods proposed by Glorot et al to enhance convergence, and the ADAM optimization algorithm is employed for training In these AEs, the Identity and Sigmoid activation functions are utilized in the bottleneck and output layers, respectively, while the remaining layers employ the ReLU activation function.

We use the validation sets to evaluate our proposed models at every

The training process will utilize early stopping after 20 epochs, halting if the average AUC scores from the four classifiers—SVM, PCT, NCT, and LR—decrease consecutively over a specified number of epochs The hyper-parameters for these classifiers are set to their default values as referenced in [108] The DBN-based model consists of three layers, as detailed in [77], and is implemented according to [109], with the number of neurons in each layer mirroring that of the AEs-based models used in our experiments.

Results and Analysis

This section describes in detail the main experiments and the investi- gation of the proposed latent representation models More importantly, we try to explain the experimental results.

Table 2.2 presents the AUC scores from four classifiers: SVM, PCT, NCT, and LR, evaluated on standalone models (STA) and various deep learning models including DBN, CNN, AE, VAE, and DAE across nine IoT datasets The top three AUC scores for each classifier are highlighted, with the highest score marked in darker gray Notably, RF is selected to compare the performance of STA against a non-linear classifier and deep learning representations versus linear classifiers.

IoT-1 IoT-2 IoT-3 IoT-4 IoT-5 IoT-6 IoT-7 IoT-8 IoT-9

AE 0.845 0.899 0.548 0.959 0.977 0.766 0.976 0.820 0.997 VAE 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 DAE 0.849 0.990 0.569 0.968 0.980 0.803 0.982 0.818 0.996 MVAE 0.914 0.948 0.978 0.985 0.932 0.950 0.998 0.826 0.858 MAE 0.999 0.997 0.999 0.987 0.982 0.999 0.999 0.846 0.842 MDAE 0.999 0.998 0.999 0.992 0.982 0.999 0.999 0.892 0.902

AE 0.849 0.892 0.498 0.965 0.977 0.813 0.977 0.814 0.815 VAE 0.503 0.501 0.499 0.501 0.507 0.497 0.500 0.500 0.499 DAE 0.882 0.903 0.534 0.969 0.982 0.862 0.984 0.857 0.849 MVAE 0.954 0.947 0.972 0.986 0.923 0.923 0.997 0.823 0.849 MAE 0.996 0.996 0.999 0.998 0.989 0.999 0.999 0.833 0.991 MDAE 0.996 0.997 0.999 0.998 0.989 0.999 0.999 0.889 0.991

AE 0.985 0.767 0.498 0.834 0.835 0.997 0.945 0.746 0.767 VAE 0.501 0.506 0.511 0.487 0.499 0.505 0.500 0.488 0.479 DAE 0.989 0.770 0.580 0.882 0.863 0.997 0.966 0.806 0.788 MVAE 0.846 0.939 0.973 0.984 0.927 0.937 0.998 0.822 0.796 MAE 0.998 0.996 0.999 0.987 0.982 0.999 0.999 0.828 0.799 MDAE 0.996 0.998 0.998 0.992 0.985 0.999 0.999 0.887 0.889

AE 0.850 0.894 0.498 0.958 0.987 0.743 0.996 0.795 0.998 VAE 0.500 0.499 0.500 0.500 0.500 0.500 0.500 0.500 0.500 DAE 0.871 0.902 0.587 0.966 0.982 0.801 0.996 0.810 0.988 MVAE 0.921 0.989 0.981 0.985 0.933 0.955 0.999 0.828 0.858 MAE 0.999 0.997 0.999 0.988 0.984 0.999 0.999 0.835 0.840 MDAE 0.996 0.998 0.998 0.992 0.985 0.999 0.999 0.887 0.889

2.5.1 Ability to Detect Unknown Attacks

This section presents the main experimental results of our chapter.

We assess the effectiveness of our proposed models in detecting unknown attacks using four classifiers trained on latent representations Each of the nine IoT datasets comprises five to ten specific types of botnet attacks For our evaluation, we randomly select two types of IoT attacks and 70% of normal traffic for training, while the remaining IoT attacks and normal data are reserved for model evaluation As detailed in Table 1.3, we train exclusively on two types of DDoS attacks, ensuring that the evaluation includes unknown attacks not encountered during training The performance of the four classifiers is compared against those utilizing the original input space and the latent feature spaces of AE, DAE, VAE, CNN, and DBN Additionally, we contrast the results of all linear classifiers with a non-linear classifier, Random Forest (RF), trained on the original features The primary experimental results, represented by AUC scores, are summarized in Table 2.2.

Table 2.2 reveals that classifiers struggle to detect unseen IoT attacks, with AUC scores around 0.5 when using the VAE model's representation This limitation arises because the VAE model focuses on generating data samples from a normal distribution rather than creating a robust representation for classification Additionally, the performance of the four classifiers on the IoT-9 dataset varies significantly compared to other datasets Notably, while LR and SVM excel with the latent representations of AE and DAE on the IoT-9 dataset, PCT and NCT do not perform as well Conversely, LR and SVM underperform compared to PCT and NCT when utilizing the latent representations from our proposed models.

It can be seen from Table 2.2 that the latent representations resulting from MVAE, MAE, and MDAE help four classifiers achieve higher clas- their regularizer.

The AUC metric significantly improves classification accuracy when using latent representations compared to original data For instance, AUC scores for SVM, PCT, NCT, and LR on the IoT-1 dataset increased from 0.839, 0.768, 0.743, and 0.862 to 0.999, 0.996, 0.998, and 0.999, respectively Similar enhancements in classification accuracy are noted with MDAE and MVAE Our proposed models enable linear classifiers to achieve higher AUC scores than those utilizing latent representations from AE and DBN Notably, PCT, when paired with latent representations from MVAE, MAE, and MDAE, improves accuracy across all IoT datasets, including IoT-9 Additionally, classifiers trained on the latent representations of MAE and MDAE yield more consistent results compared to MVAE.

A comparison of linear classifiers and a non-linear classifier, such as Random Forest (RF), reveals that RF significantly outperforms linear classifiers when trained on original features, indicating that the datasets are not linearly separable However, when linear classifiers are trained on the latent representations from MVAE, MAE, and MDAE, their accuracy improves dramatically, often surpassing that of RF, except in the case of IoT-8, where RF remains superior This demonstrates that the proposed models effectively transform non-linear datasets in the original space into linearly separable data in the latent space.

We conducted an experiment to demonstrate how our models enhance conventional classifiers in efficiently detecting unknown attacks In this study, we trained the Autoencoder (AE) and Modified Autoencoder (MAE) using normal data and TCP attack data, with the hidden layer size set to 2 for better visualization After training, we evaluated these models on a test dataset that included normal samples, known TCP attacks, and unknown UDP attacks The results, illustrated in Fig 2.5, display 1000 random samples from both training and testing data within the hidden space of AE and MAE, highlighting the representation of the data.

The Autoencoder (AE) effectively distinguishes between normal, known, and unknown attack samples, contributing to the high performance of classifiers as shown in Table 2.2 In contrast, the Modified Autoencoder (MAE) compresses normal and known attack samples into two compact areas in both training and testing datasets, while AE shows a wider spread of these samples Notably, unknown attack samples in the MAE testing data are closely mapped to known attacks, facilitating easier differentiation from normal samples Conversely, unknown attacks in AE are situated near normal data, making them challenging to separate This demonstrates that the MAE model successfully constrains normal and known attack data into compact regions in the hidden space, allowing for effective identification of both known and unknown attacks using simple classifiers on MAE's latent features.

The Gafgyt botnet family represents a lightweight variant of the Internet Relay Chat model, primarily executing traditional DDoS attacks such as SYN, UDP, and ACK flooding In contrast, the Mirai botnet is considered more hazardous, capable of exploiting various device architectures and launching a diverse array of DDoS attacks across multiple protocols, including TCP, UDP, and HTTP.

Botnet families like Gafgyt and Mirai are capable of launching multiple DDoS attacks Each botnet generates unique network traffic from bots to infected devices, leading to distinct feature values.

(a) AE representation of training samples.

30 Unknown Attack Normal TCP flood attack

(b) AE representation of testing samples.

(c) MAE representation of training samples.

20 Unknown attack Normal TCP flood attack

(d) MAE representation of testing samples.

Figure 2.5: Latent representation resulting from AE model (a,b) and MAE model

This experiment investigates the stability of latent representations generated by MVAE, MAE, and MDAE when trained on one botnet family and evaluated on another We explore two scenarios: training with Mirai and testing with Gafgyt, and vice versa These scenarios ensure that the testing attack family is unseen during training The NCT classifier is employed to assess our models, with results for NCT trained on IoT-2 6 presented in Table 2.3 The second row of the table details models trained on Gafgyt and evaluated on Mirai, while the third row shows the reverse scenario Notably, we exclude CNN and VAE from this experiment due to their ineffectiveness in detecting the attacks listed in Table 2.2.

Table 2.3: AUC score of the NCT classifier on the IoT-2 dataset in the cross-datasets experiment.

Train/Test botnets STA DBN AE MVAE MAE MDAE

The table indicates that the NCT classifier struggles to detect unknown botnet attacks when the training and testing data originate from different botnet families Both the standalone NCT and the NCT enhanced with AE and DBN representations exhibit subpar performance in these situations This limitation arises because AE and DBN require a substantial amount of data to effectively capture the useful information from the input data.

Training and testing attacks originate from distinct botnet families, which can hinder the ability of the trained Autoencoder (AE) and Deep Belief Network (DBN) to effectively represent unseen attacks, leading to suboptimal performance of the NCT classifier, as indicated in the first three rows of Table 2.3 Conversely, the latent representations of MVAE, MAE, and MDAE are specifically designed to preserve certain regions, enhancing their robustness against such discrepancies.

Due to space constraints, we present the NCT classifier results solely for the IoT-2 dataset, while the performance of other classifiers on the remaining datasets is consistent with the findings discussed in this section.

The analysis of the noise factor's impact on MDAE performance, as illustrated in Figure 2.6, reveals that a noise standard deviation of σ noise = 0.01 yields the highest AUC scores while minimizing FAR and MDR across SVM, PCT, NCT, and LR on the IoT-1 dataset The introduction of adversarial examples (AEs) significantly enhances NCT's capability to identify unknown IoT attacks, as evidenced by the AUC scores for predicting the Mirai botnet, which rise from 0.747 with original data to 0.943, 0.974, and 0.988 with MVAE, MAE, and MDAE representations, respectively These findings underscore the effectiveness of our learning representation models in improving the detection of unknown IoT attacks by simple classifiers.

This section examines how various key parameters affect the performance of the proposed models, focusing on noise factors in MDAE and hyper-parameters in SVM and NCT classifiers.

2.5.3.1 Influence of the Noise Factor

This experiment examines the impact of the noise factor on the MDAE’s performance In this chapter, the Gaussian noise function in Eq 1.10

(a) SVM. cosine euclidean manhattan mahalanobis chebyshev

Conclusion

In this chapter, we have designed three novel AE based models for learning a new latent representation to enhance the accuracy in NAD.

Our research introduces the first regularized versions of Autoencoders (AEs) designed for supervised learning to create latent representations In our models, normal data and known attacks are distinctly projected into two closely separated regions within the latent feature space We achieved this by incorporating new regularization terms into three AE variants, resulting in the development of three regularized models: MVAE, MAE, and MDAE These models are trained on normal data alongside known IoT attacks, with the bottleneck layer of the trained AEs serving as a new feature space for linear classifiers.

Our extensive experiments have shown that our proposed models effectively transform non-linear separable normal and attack data into linear and isolated representations in the latent feature space Notably, unknown attacks cluster near known attacks in this space, simplifying classification tasks compared to the original feature space Linear classifiers utilizing our latent features significantly outperform those using original features or those generated by Autoencoders (AE) and Deep Belief Networks (DBN) across nine IoT attack datasets Additionally, this new data representation ensures consistent classifier performance when trained on diverse datasets and across various hyper-parameter settings.

Future work can expand on our current research in several key areas Firstly, the models discussed are limited to two-class classification problems, and it would be beneficial to explore their application in multi-class classification scenarios Secondly, the determination of distribution centroids, as indicated in Eq 2.1, relies on trial and error; thus, developing an automated method for selecting optimal values for each dataset is essential Lastly, while the regularized AE models have been tested on a few IoT datasets, it is important to evaluate their performance across a broader spectrum of problems.

DEEP GENERATIVE LEARNING MODELS FOR NETWORK

In Chapter 2, we introduced a representation learning method that effectively distinguishes between normal and abnormal traffic, enhancing the accuracy of machine learning in detecting network attacks, particularly new or unknown ones However, this method relies on the availability of sufficient labeled data for both traffic types, which is often not feasible in real-world scenarios Collecting data on attack traffic is typically more challenging than gathering normal traffic data, leading to imbalanced datasets Consequently, predictive models built with traditional machine learning algorithms may become biased and yield inaccurate results when applied to such skewed datasets.

This chapter introduces an innovative approach to enhance robust Network Anomaly Detection (NAD) systems through deep neural networks We present deep generative models designed to synthesize malicious samples within network systems Initially, we employ a hybrid model combining Auxiliary Conditional Generative Adversarial Network (ACGAN) and Support Vector Machine (SVM) to generate targeted malicious samples, utilizing SVM for selecting borderline cases Next, we propose a Conditional Denoising Adversarial AutoEncoder (CDAAE) as a more effective alternative to the ACGAN-based model for creating specific malicious samples Finally, we introduce a hybrid model that integrates CDAAE with the K-nearest Neighbor algorithm (CDAEE-KNN) to produce samples that significantly enhance the accuracy of NAD systems The synthesized samples are then merged for improved performance.

DEEP GENERATIVE LEARNING MODELS FOR

DEEP TRANSFER LEARNING FOR NETWORK

Ngày đăng: 26/06/2023, 19:50

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w