Domain Generation Algorithm (DGA) is the group of algorithms that generate domain names for attack activities in botnets. In this paper, we present a Bi-LSTM deep learning model based on Attention mechanism to detect DGA-generated domains.
Trang 1A DEEP LEARNING MODEL THAT DETECTS THE DOMAIN GENERATED BY THE ALGORITHM IN THE BOTNET
Nguyen Trung Hieu (*) , Cao Chinh Nghia
Faculty of Mathematics - Informatics and application of science and technology in crime
prevention, The People's Police Academy
Abstract: Domain Generation Algorithm (DGA) is the group of algorithms that generate
domain names for attack activities in botnets In this paper, we present a Bi-LSTM deep learning model based on Attention mechanism to detect DGA-generated domains Through the experimental process, our model has given good results in detecting DGA-generated domains belong to the Post and Monerodownloader family In general, the F1 measure of the model in the multi-class classification problem reaches 90% The micro average (macro avg) efficiency is 86% and the average (weighted avg) efficiency is 91%
Keywords: Bi-LSTM deep learning network; deep learning; malicious URL detection;
Attention mechanism in deep learning
Received 1 June 2022
Revised and accepted for publication 26 July 2022
(*) Email: hieunt.dcn@gmail.com
1 INTRODUCE
Botnet Attacks
The development of Internet has brought many benefits to users, but it is an environment for cybercriminals to operate also
Botnet Attack is one of the common attacks Each member of the botnet is called a bot A bot is a malicious software created by attackers that control infected computers remotely through a command and control server (C&C server) The bot has a high degree of autonomy and is equipped with the ability to use communication channels to receive commands and update malicious code from the control system Botnets are commonly used to transmit malware, send spam, steal sensitive information, phishing, or create large-scale cyberattacks such as distributed denial of service (DdoS) attacks [1]
Trang 2The distribution’s widespread of bots and the connection between bots and control servers often requires the Internet The bots need to know the IP address of the control server to access and receive commands In order to avoid detection, command and control servers do not register static domain names, instead of continuously change addresses and different domains at different intervals Attackers use Domain Generation Algorithm (DGA) to generate different domain names for attacks [2] aimed at masking these control and control servers
Identifying the attack of a malicious domain can effectively determine the purpose of the attack, the tools and malware used, and take preventive measures to greatly reduce the damage caused by the attack induced attack
Domain Generation Algorithm
The Domain Generation Algorithm (DGA) can use operators in combination with ever-changing variables to generate random domain names The variables can be day, month, year values, hours, minutes, seconds or other keywords These pseudo-random strings are concatenated with the Top-level domain (.com, vn, net ) to generate the domain names The algorithm of the Chinad malware written in Python [3] shows the input seed includes letters from a-z and numbers from 0-9 and combines the values of days, months, five The results are combined with the TLDs ('.com', '.org', '.net', '.biz', '.info', '.ru', '.cn') to form the complete domain name
Table 1 Some DGA samples
Conflicker gfedo.info
ydqtkptuwsa.org bnnkqwzmy.biz
Bigvikto
r
support.showremote-conclusion.fans turntruebreakfast.futbol
speakoriginalworld.one Cryptolock
er
nvjwoofansjbh.ru qgrkvevybtvckik.org eqmbcmgemghxbcj.co
uk
Bamital cd8f66549913a78c5a8004c82bcf6b01.i
nfo a024603b0defd57ebfef34befde16370.o
rg 5e6efdd674c134ddb2a7a2e3c603cc14 org
Chinad qowhi81jvoid4j0m.biz
29cqdf6obnq462yv.co
m 5qip6brukxyf9lhk.ru
Trang 3A DGA can generate a large number of domains in a short time, and bots can select a small portion of them to connect to the C&C server Table 1 shows some examples of domain names
initialized with DGA [4] The Chinad malware can generate 1000 domain names per day with the letters a-z and numbers 0-9 Bigviktor combines 3 to 4 different words from 4 predefined lists (dictionaries) that can generate 1000 domains per month
Figure 1 depicts the connection process between the C&C server and the DGA domains [5] The attacker uses the same DGA and initial kernels for the C&C server and the bot to generate the same domain dataset The attacker needs to select a domain name only from the generated list and register it for the C&C server 1 hour before performing the attack The bots on the victim's machine will
in turn send the domain name resolution requests in the generated list of domains to the Domain Name System (DNS) The DNS system will return the IP address of the corresponding C&C server, then the bots begin to communicate with the server to receive the command If the C&C server is not found in the previous domain, the bots will query the next set of domains generated
by the DGA until an active domain name is found [6]
2 MAIN CONTRIBUTION OF THE ARTICLE
The main of contributions of the paper include:
1 - Introduce a deep learning approach using Bidirectional-Long Short Term Memory (Bi-LSTM) model based on Attention mechanism in detecting domains created by DGA Our model has worked well in the problem of detecting malicious URLs [7]
2 - Presenting experimental results shows a significant improvement compared to previous techniques with the use of open data sets
The remainder of paper is organized as follows: Section 2 presents related studies Our deep learning network architecture and solution is presented in Section 3 Section 4 presents our experimental process, including the steps to select the database and the results obtained Finally, Section 5 is the conclusion, comments on the results achieved as well as the future direction of the paper
Figure 1 DGA-based botnet communication mechanism
Trang 42.1 Related studies
In recent years, much research on Botnet detection has been published Nguyen Van Can and colleagues [8] proposed a model to classify benign domains and DGA domains based on Neutrosophic Sets Testing on 3 data sets of Alexa, Bambenek Consulting [9] and 360lab [4] shows that the model has an accuracy result of 81.09%
R Vinayakumar et al [10] have proposed a DGA detection method based on analyzing the statistical features of DNS queries Feature vectors are extracted from domain names by text representation method, optimal features are calculated from numerical vectors using deep learning architecture in Table 2 The results show that the model has high accuracy with 97.8%
Yanchen Qiao et al [2] have proposed a method to detect DGA domain names based on LSTM using Attention mechanism Their model is executed on the data set from Bambenek Consulting [9], with an accuracy of 95.14%, overall precision of 95.05%, recall of 95.14% and F1 score of 95.48%
Duc Tran [11] built an LSTM.MI model that combines binary classifier and multiclass classifier with an unbalanced dataset In which, the original LSTM model is applied a cost-sensitive adaptation mechanism Cost items are included in the back-to-back learning process
to account for the importance of delineation between classes They demonstrate that LSTM.MI provides at least 7% improvement in accuracy and macro mean recall over the original LSTM and other modern cost-sensitive methods It can also maintain high accuracy on non-DGA generated labels (0.9849 F1 points)
2.2 Proposed model
Our proposed model includes: input layer, embedded layer, two Bi-LSTM layers, one attention layer and output layer The architecture of the model is shown in Figure 2 [7]
Table 2 DBD deep architecture [10]
Trang 5The detection module will take as input a data set of
𝑇 domain addresses with a structure of the form {(𝑢1, 𝑦1), … , (𝑢𝑇, 𝑦𝑇)} Where, xt is a pair (𝑢𝑡, 𝑦𝑡)where u t (with t = 1, …, 𝑇) is a domain in the
training list and 𝑦𝑡an associated label
Each domain, in its raw form, before being trained,
is processed in two steps to form the input vector:
- Step 1: Cut off the TLD part of the domain name then tokenize the raw data – convert the string of characters in the rest to encrypted data in the form of an integer using Keras's Tokenizer library;
- Step 2: Normalize the size of the encrypted data in step 1 to the same length This way we can convert the
original domain string into the input vector V = {v1 , v 2 ,
v 3 , …v T } Each vector has a fixed length Any missing
vector, add the value 0 to give the length enough Next, we use a bidirectional LSTM network (Bi-LSTM) to model URL sequences based
on a word vector representation In Bi-LSTM architecture, there are two layers of nodes hidden from two separate LSTMs, two LSTMs capturing distant dependencies in two different
directions Since the output vector of the embedded layer is V = {v1, v2, v3, …vT }, the forward
LSTM will read the input from v1 to vT and the backward LSTM will read the input from v T
to v 1 Meanwhile a pair of hidden states ℎ⃗⃗⃗ 𝑣à ℎ𝑖 ⃖⃗⃗⃗is initialized We can get the output of the Bi-𝑖 LSTM layer by combining the two hidden states according to the formula:
𝒉𝒊= [𝒉 ⃗⃗⃗ , 𝒉𝒊 ⃖⃗⃗⃗]𝒊 𝑻 (1)
It uses two layers of Bi-LSTM and the experimental data set is quite large Therefore, Batch Normalization layer will be used to normalize the data in batch layers to a normal distribution
to stabilize the learning process and greatly reduce the number of epochs needed to train the network, thereby increasing the speed of training training
As described in this paper, the hidden states at all locations are considered with different Attention weights We apply Attention mechanism to capture the relationship between ℎ⃗⃗⃗ 𝑣à ℎ𝑖 ⃖⃗⃗⃗ 𝑖 This information is aggregated with respect to the feature from the output of the second Bi-LSTM network This helps the model to focus only on the important features instead of the confounding or less valuable information
Figure 2 Bi-LSTM network architecture
Trang 6Initially, the weights 𝑢𝑡 are calculated based on the correlation between the input and output according to the following formula:
These weights will be renormalized to the Attention weight vector 𝛼𝑡using the softmax function:
𝜶𝒕 = 𝐞𝐱𝐩 (𝒖𝒕
Then the vector 𝑐𝑡is calculated based on the Attention weight vector and the hidden states
ℎ1… ℎ𝑇 as follows:
𝒄𝒕 = ∑ 𝜶𝒕𝒉𝒕
𝒕
(4)
value 𝑐𝑡, the more important the feature 𝑥𝑡plays in detecting the DGA domain
Finally, to predict a domain, the calculation results are passed through a Dense layer with
1 hidden neuron using the activation function sigmoid to receive a return value between 0 and
1 The resulting y will be helps determine if a domain is benign or DGA Thus, the input domain
name will be normalized into a vector form, this vector will be passed through the Embedding, Bi-LSTM, Batch Normalization, Bi-LSTM, Attention layers before giving the output result In addition, the model uses adam optimization algorithm with default parameters in keras And to prevent the model from falling into overfitting state (overfitting) compared with the real model
of the data, we use more Dropout technique for Bi-LSTM layers The mechanism of Dropout
is that in the process of training the model, with each time we update the weights, we randomly remove the number of nodes in the layer so that the model cannot depend on any node of the previous layer, but instead which tends to spread evenly
2.3 Experiment
In this paper, we conduct 2 experiments:
1- Experimentally check the accuracy of the model in 2-class classification: Domains generated by DGA algorithm and normal domains
2- Experiment to check the accuracy of the model in multi-class classification: Detect different DGA algorithms in a given data set
Trang 72.4 Evaluation Dataset
In this paper, we use a dataset consisting of DGA domains collected from Bambenek Consulting [9] and normal domains obtained from Alexa With two different tests, we use two different data sets
Table 2 Summary of the collect dataset
Domain Type Sample Domain Type Sample Domain Type Sample
Dataset for test 1: Consists of 30000 DGA domains with label 1 and 30000 normal domains with label 0 This dataset is randomly shuffled, then divided into two small sets as training dataset and test dataset In which there are 46 different types of DGA domain names with the number given in Table 2 The distribution parameters of character length of each type of domain name given in Table 3 In which, the sample with the smallest length is 6, the maximum length
is 25 and the average length of the DGA domain name is 14.2, the normal domain name is 9.6
Table 4 Label assessment to types
Trang 8Dataset for test 2: With the goal of testing the multi-class classification, the types of DGA domains used include the families Post, Kraken, Monerodownloader, Murofet, Necurs, Shiotob/urlzone/bebloh, Qakbot, Ramnit, Ranbyus, Tinba are labeled according to Table 4 The number of domain names for test 2 includes 25,000 normal domain names and 25,000 domain names belonging to DGA families
3 PERFORMANCE METRIC
The performance of the algorithms is evaluated using the confusion matrix In there:
• True negatives (TN) – are benign sites that are predicted to be benign
• True Positives (TPs) – are malicious sites that are expected to be malicious
• False negatives (FN) – are malicious sites that are expected to be benign
• False positives (FPs) – are benign but expected to be malicious sites
From there we have the measures:
Accuracy:
The article also uses the measures of F-measure, precision, and recall, which are shown in the following formulas:
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = 𝑻𝑷
𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑻𝑷
𝑭𝟏 = 𝟐 ∗ 𝑹𝒆𝒄𝒂𝒍𝒍 ∗ 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏
𝑹𝒆𝒄𝒂𝒍𝒍 + 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 (8)
A high Precision value means that the accuracy of the points found is high A high recall means a high TP rate, which means that the rate of missing really positive points is low The higher the F1, the better the classifier In addition,
we also use the loss function binary cross entropy
(BCE) to calculate the difference between two
quantities: 𝑦̂- the label of the predicted URL and
y - the correct label of each URL Loss function is
like a way to force the model to pay a penalty for
each wrong prediction, and the number of
penalties is proportional to the severity of the
error The smaller the loss value, the better the
model shows that the prediction results are good, on the contrary, if the prediction results differ too much from reality, the larger the loss value
𝑩𝑪𝑬 = −(𝒚𝒍𝒐𝒈(𝒚 ̂) + (𝟏 − 𝒚) 𝒍𝒐𝒈(𝟏 − 𝒚̂)) (9)
Table 5 Parameter of model
In experience no.1
bidirectional (None, 38, 128) batch_normalization (None, 38, 128) bidirectional_1 (None, 38, 128) attention_with_context (None, 38, 128)
Trang 93.1 Experimental results
The model is built on the basic configuration of the Kaggle platform with Keras kernel, Tensorflow backend Which uses ModelCheckPoint to save the training process and EarlyStopping to immediately stop the training process when the best value is found
3.1.1 Experiment number 1
The parameters of the model in the first experiment are showed in Table 5
Table 7 Parameter of the model in experience no 2
With the binary classification problem between the DGA domain and the normal domain, the model gives the results in Table 6 with an accuracy of up to 99% With this result, we assume that there is a difference coming from the distribution of the domain's length We will run other tests to further test the stability of the model
3.2 Experiment number 2
Table 8 Results of experiment 2
Table 6 Experimental results no 1
Trang 105 0.81 0.85 0.83 5746
In this experiment, we test the multi-class detection ability of the model with three measures of precision, recall and f1 The parameters used in the model are presented in Table
7 Due to multi-class classification, as an output, we use a hidden layer of size 11 corresponding
to 11 labels to be classified The experimental results are presented in Table 8 With the normal domain (labeled as 2), the Precision is 98%, F1 is 99% Our model gives the best results when classifying labels for DGA domains belonging to the Post family (label number 0) and Monerodownloader (label number 3) In contrast, the model gave the worst results on the Qakbot family (label 6) when the rate of classifying a benign site as a malicious site with a Precision measure of 52% For the Murofet family (label number 4) and Tinba (label number 10), the model gives false classification results when evaluating the DGA domain name into a benign domain with a recall measure of 59% In general, the F1 measure of the model in the multi-class classification problem reaches 90% The micro average (macro avg) efficiency is 86% and the average (weighted avg) efficiency is 91%
4 COMPARISON WITH OTHER DGAS DETECTION METHODS
The evaluation was performed on a dataset from the same source [9] as the studies being compared The results compared with the study of Chanwoong Hwang and colleagues are shown in Table 9 showing that our model has a higher detection capacity
Table 10 compares the ability to detect DGA domains labeled 4,5,6,7,8,9,10 in the study of Yanchen Qiao and Duc Tran Yanchen Qiao [2] using LSTM with Attention mechanism Duc Tran's model [11] is a cost-sensitive original LSTM Cost items are class dependent, taking into account the importance of classification between classes Our model exhibits good detectability across four DGA families: Necurs, Qakbot, Ramnit, Ranbyus and lesser on the Shiotob family, tinba
Table 9 Comparison
Chanwoong
Hwang Proposed model