A deep learning model that detects the domain generated by the algorithm in the botnet

Domain Generation Algorithm (DGA) is the group of algorithms that generate domain names for attack activities in botnets. In this paper, we present a Bi-LSTM deep learning model based on Attention mechanism to detect DGA-generated domains.

Trang 1

A DEEP LEARNING MODEL THAT DETECTS THE DOMAIN GENERATED BY THE ALGORITHM IN THE BOTNET

Nguyen Trung Hieu (*) , Cao Chinh Nghia

Faculty of Mathematics - Informatics and application of science and technology in crime

prevention, The People's Police Academy

Abstract: Domain Generation Algorithm (DGA) is the group of algorithms that generate

domain names for attack activities in botnets In this paper, we present a Bi-LSTM deep learning model based on Attention mechanism to detect DGA-generated domains Through the experimental process, our model has given good results in detecting DGA-generated domains belong to the Post and Monerodownloader family In general, the F1 measure of the model in the multi-class classification problem reaches 90% The micro average (macro avg) efficiency is 86% and the average (weighted avg) efficiency is 91%

Keywords: Bi-LSTM deep learning network; deep learning; malicious URL detection;

Attention mechanism in deep learning

Received 1 June 2022

Revised and accepted for publication 26 July 2022

(*) Email: hieunt.dcn@gmail.com

1 INTRODUCE

Botnet Attacks

The development of Internet has brought many benefits to users, but it is an environment for cybercriminals to operate also

Botnet Attack is one of the common attacks Each member of the botnet is called a bot A bot is a malicious software created by attackers that control infected computers remotely through a command and control server (C&C server) The bot has a high degree of autonomy and is equipped with the ability to use communication channels to receive commands and update malicious code from the control system Botnets are commonly used to transmit malware, send spam, steal sensitive information, phishing, or create large-scale cyberattacks such as distributed denial of service (DdoS) attacks [1]

Trang 2

The distribution’s widespread of bots and the connection between bots and control servers often requires the Internet The bots need to know the IP address of the control server to access and receive commands In order to avoid detection, command and control servers do not register static domain names, instead of continuously change addresses and different domains at different intervals Attackers use Domain Generation Algorithm (DGA) to generate different domain names for attacks [2] aimed at masking these control and control servers

Identifying the attack of a malicious domain can effectively determine the purpose of the attack, the tools and malware used, and take preventive measures to greatly reduce the damage caused by the attack induced attack

Domain Generation Algorithm

The Domain Generation Algorithm (DGA) can use operators in combination with ever-changing variables to generate random domain names The variables can be day, month, year values, hours, minutes, seconds or other keywords These pseudo-random strings are concatenated with the Top-level domain (.com, vn, net ) to generate the domain names The algorithm of the Chinad malware written in Python [3] shows the input seed includes letters from a-z and numbers from 0-9 and combines the values of days, months, five The results are combined with the TLDs ('.com', '.org', '.net', '.biz', '.info', '.ru', '.cn') to form the complete domain name

Table 1 Some DGA samples

Conflicker gfedo.info

ydqtkptuwsa.org bnnkqwzmy.biz

Bigvikto

r

support.showremote-conclusion.fans turntruebreakfast.futbol

speakoriginalworld.one Cryptolock

er

nvjwoofansjbh.ru qgrkvevybtvckik.org eqmbcmgemghxbcj.co

uk

Bamital cd8f66549913a78c5a8004c82bcf6b01.i

nfo a024603b0defd57ebfef34befde16370.o

rg 5e6efdd674c134ddb2a7a2e3c603cc14 org

Chinad qowhi81jvoid4j0m.biz

29cqdf6obnq462yv.co

m 5qip6brukxyf9lhk.ru

Trang 3

A DGA can generate a large number of domains in a short time, and bots can select a small portion of them to connect to the C&C server Table 1 shows some examples of domain names

initialized with DGA [4] The Chinad malware can generate 1000 domain names per day with the letters a-z and numbers 0-9 Bigviktor combines 3 to 4 different words from 4 predefined lists (dictionaries) that can generate 1000 domains per month

Figure 1 depicts the connection process between the C&C server and the DGA domains [5] The attacker uses the same DGA and initial kernels for the C&C server and the bot to generate the same domain dataset The attacker needs to select a domain name only from the generated list and register it for the C&C server 1 hour before performing the attack The bots on the victim's machine will

in turn send the domain name resolution requests in the generated list of domains to the Domain Name System (DNS) The DNS system will return the IP address of the corresponding C&C server, then the bots begin to communicate with the server to receive the command If the C&C server is not found in the previous domain, the bots will query the next set of domains generated

by the DGA until an active domain name is found [6]

2 MAIN CONTRIBUTION OF THE ARTICLE

The main of contributions of the paper include:

1 - Introduce a deep learning approach using Bidirectional-Long Short Term Memory (Bi-LSTM) model based on Attention mechanism in detecting domains created by DGA Our model has worked well in the problem of detecting malicious URLs [7]

2 - Presenting experimental results shows a significant improvement compared to previous techniques with the use of open data sets

The remainder of paper is organized as follows: Section 2 presents related studies Our deep learning network architecture and solution is presented in Section 3 Section 4 presents our experimental process, including the steps to select the database and the results obtained Finally, Section 5 is the conclusion, comments on the results achieved as well as the future direction of the paper

Figure 1 DGA-based botnet communication mechanism

Trang 4

2.1 Related studies

In recent years, much research on Botnet detection has been published Nguyen Van Can and colleagues [8] proposed a model to classify benign domains and DGA domains based on Neutrosophic Sets Testing on 3 data sets of Alexa, Bambenek Consulting [9] and 360lab [4] shows that the model has an accuracy result of 81.09%

R Vinayakumar et al [10] have proposed a DGA detection method based on analyzing the statistical features of DNS queries Feature vectors are extracted from domain names by text representation method, optimal features are calculated from numerical vectors using deep learning architecture in Table 2 The results show that the model has high accuracy with 97.8%

Yanchen Qiao et al [2] have proposed a method to detect DGA domain names based on LSTM using Attention mechanism Their model is executed on the data set from Bambenek Consulting [9], with an accuracy of 95.14%, overall precision of 95.05%, recall of 95.14% and F1 score of 95.48%

Duc Tran [11] built an LSTM.MI model that combines binary classifier and multiclass classifier with an unbalanced dataset In which, the original LSTM model is applied a cost-sensitive adaptation mechanism Cost items are included in the back-to-back learning process

to account for the importance of delineation between classes They demonstrate that LSTM.MI provides at least 7% improvement in accuracy and macro mean recall over the original LSTM and other modern cost-sensitive methods It can also maintain high accuracy on non-DGA generated labels (0.9849 F1 points)

2.2 Proposed model

Our proposed model includes: input layer, embedded layer, two Bi-LSTM layers, one attention layer and output layer The architecture of the model is shown in Figure 2 [7]

Table 2 DBD deep architecture [10]

Trang 5

The detection module will take as input a data set of

𝑇 domain addresses with a structure of the form {(𝑢1, 𝑦1), … , (𝑢𝑇, 𝑦𝑇)} Where, xt is a pair (𝑢𝑡, 𝑦𝑡)where u t (with t = 1, …, 𝑇) is a domain in the

training list and 𝑦𝑡an associated label

Each domain, in its raw form, before being trained,

is processed in two steps to form the input vector:

- Step 1: Cut off the TLD part of the domain name then tokenize the raw data – convert the string of characters in the rest to encrypted data in the form of an integer using Keras's Tokenizer library;

- Step 2: Normalize the size of the encrypted data in step 1 to the same length This way we can convert the

original domain string into the input vector V = {v1 , v 2 ,

v 3 , …v T } Each vector has a fixed length Any missing

vector, add the value 0 to give the length enough Next, we use a bidirectional LSTM network (Bi-LSTM) to model URL sequences based

on a word vector representation In Bi-LSTM architecture, there are two layers of nodes hidden from two separate LSTMs, two LSTMs capturing distant dependencies in two different

directions Since the output vector of the embedded layer is V = {v1, v2, v3, …vT }, the forward

LSTM will read the input from v1 to vT and the backward LSTM will read the input from v T

to v 1 Meanwhile a pair of hidden states ℎ⃗⃗⃗ 𝑣à ℎ𝑖 ⃖⃗⃗⃗is initialized We can get the output of the Bi-𝑖 LSTM layer by combining the two hidden states according to the formula:

𝒉𝒊= [𝒉 ⃗⃗⃗ , 𝒉𝒊 ⃖⃗⃗⃗]𝒊 𝑻 (1)

It uses two layers of Bi-LSTM and the experimental data set is quite large Therefore, Batch Normalization layer will be used to normalize the data in batch layers to a normal distribution

to stabilize the learning process and greatly reduce the number of epochs needed to train the network, thereby increasing the speed of training training

As described in this paper, the hidden states at all locations are considered with different Attention weights We apply Attention mechanism to capture the relationship between ℎ⃗⃗⃗ 𝑣à ℎ𝑖 ⃖⃗⃗⃗ 𝑖 This information is aggregated with respect to the feature from the output of the second Bi-LSTM network This helps the model to focus only on the important features instead of the confounding or less valuable information

Figure 2 Bi-LSTM network architecture

Trang 6

Initially, the weights 𝑢𝑡 are calculated based on the correlation between the input and output according to the following formula:

These weights will be renormalized to the Attention weight vector 𝛼𝑡using the softmax function:

𝜶𝒕 = 𝐞𝐱𝐩 (𝒖𝒕

Then the vector 𝑐𝑡is calculated based on the Attention weight vector and the hidden states

ℎ1… ℎ𝑇 as follows:

𝒄𝒕 = ∑ 𝜶𝒕𝒉𝒕

𝒕

(4)

value 𝑐𝑡, the more important the feature 𝑥𝑡plays in detecting the DGA domain

Finally, to predict a domain, the calculation results are passed through a Dense layer with

1 hidden neuron using the activation function sigmoid to receive a return value between 0 and

1 The resulting y will be helps determine if a domain is benign or DGA Thus, the input domain

name will be normalized into a vector form, this vector will be passed through the Embedding, Bi-LSTM, Batch Normalization, Bi-LSTM, Attention layers before giving the output result In addition, the model uses adam optimization algorithm with default parameters in keras And to prevent the model from falling into overfitting state (overfitting) compared with the real model

of the data, we use more Dropout technique for Bi-LSTM layers The mechanism of Dropout

is that in the process of training the model, with each time we update the weights, we randomly remove the number of nodes in the layer so that the model cannot depend on any node of the previous layer, but instead which tends to spread evenly

2.3 Experiment

In this paper, we conduct 2 experiments:

1- Experimentally check the accuracy of the model in 2-class classification: Domains generated by DGA algorithm and normal domains

2- Experiment to check the accuracy of the model in multi-class classification: Detect different DGA algorithms in a given data set

Trang 7

2.4 Evaluation Dataset

In this paper, we use a dataset consisting of DGA domains collected from Bambenek Consulting [9] and normal domains obtained from Alexa With two different tests, we use two different data sets

Table 2 Summary of the collect dataset

Domain Type Sample Domain Type Sample Domain Type Sample

Dataset for test 1: Consists of 30000 DGA domains with label 1 and 30000 normal domains with label 0 This dataset is randomly shuffled, then divided into two small sets as training dataset and test dataset In which there are 46 different types of DGA domain names with the number given in Table 2 The distribution parameters of character length of each type of domain name given in Table 3 In which, the sample with the smallest length is 6, the maximum length

is 25 and the average length of the DGA domain name is 14.2, the normal domain name is 9.6

Table 4 Label assessment to types

Trang 8

Dataset for test 2: With the goal of testing the multi-class classification, the types of DGA domains used include the families Post, Kraken, Monerodownloader, Murofet, Necurs, Shiotob/urlzone/bebloh, Qakbot, Ramnit, Ranbyus, Tinba are labeled according to Table 4 The number of domain names for test 2 includes 25,000 normal domain names and 25,000 domain names belonging to DGA families

3 PERFORMANCE METRIC

The performance of the algorithms is evaluated using the confusion matrix In there:

• True negatives (TN) – are benign sites that are predicted to be benign

• True Positives (TPs) – are malicious sites that are expected to be malicious

• False negatives (FN) – are malicious sites that are expected to be benign

• False positives (FPs) – are benign but expected to be malicious sites

From there we have the measures:

Accuracy:

The article also uses the measures of F-measure, precision, and recall, which are shown in the following formulas:

𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = 𝑻𝑷

𝑹𝒆𝒄𝒂𝒍𝒍 = 𝑻𝑷

𝑭𝟏 = 𝟐 ∗ 𝑹𝒆𝒄𝒂𝒍𝒍 ∗ 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏

𝑹𝒆𝒄𝒂𝒍𝒍 + 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 (8)

A high Precision value means that the accuracy of the points found is high A high recall means a high TP rate, which means that the rate of missing really positive points is low The higher the F1, the better the classifier In addition,

we also use the loss function binary cross entropy

(BCE) to calculate the difference between two

quantities: 𝑦̂- the label of the predicted URL and

y - the correct label of each URL Loss function is

like a way to force the model to pay a penalty for

each wrong prediction, and the number of

penalties is proportional to the severity of the

error The smaller the loss value, the better the

model shows that the prediction results are good, on the contrary, if the prediction results differ too much from reality, the larger the loss value

𝑩𝑪𝑬 = −(𝒚𝒍𝒐𝒈(𝒚 ̂) + (𝟏 − 𝒚) 𝒍𝒐𝒈(𝟏 − 𝒚̂)) (9)

Table 5 Parameter of model

In experience no.1

bidirectional (None, 38, 128) batch_normalization (None, 38, 128) bidirectional_1 (None, 38, 128) attention_with_context (None, 38, 128)

Trang 9

3.1 Experimental results

The model is built on the basic configuration of the Kaggle platform with Keras kernel, Tensorflow backend Which uses ModelCheckPoint to save the training process and EarlyStopping to immediately stop the training process when the best value is found

3.1.1 Experiment number 1

The parameters of the model in the first experiment are showed in Table 5

Table 7 Parameter of the model in experience no 2

With the binary classification problem between the DGA domain and the normal domain, the model gives the results in Table 6 with an accuracy of up to 99% With this result, we assume that there is a difference coming from the distribution of the domain's length We will run other tests to further test the stability of the model

3.2 Experiment number 2

Table 8 Results of experiment 2

Table 6 Experimental results no 1

Trang 10

5 0.81 0.85 0.83 5746

In this experiment, we test the multi-class detection ability of the model with three measures of precision, recall and f1 The parameters used in the model are presented in Table

7 Due to multi-class classification, as an output, we use a hidden layer of size 11 corresponding

to 11 labels to be classified The experimental results are presented in Table 8 With the normal domain (labeled as 2), the Precision is 98%, F1 is 99% Our model gives the best results when classifying labels for DGA domains belonging to the Post family (label number 0) and Monerodownloader (label number 3) In contrast, the model gave the worst results on the Qakbot family (label 6) when the rate of classifying a benign site as a malicious site with a Precision measure of 52% For the Murofet family (label number 4) and Tinba (label number 10), the model gives false classification results when evaluating the DGA domain name into a benign domain with a recall measure of 59% In general, the F1 measure of the model in the multi-class classification problem reaches 90% The micro average (macro avg) efficiency is 86% and the average (weighted avg) efficiency is 91%

4 COMPARISON WITH OTHER DGAS DETECTION METHODS

The evaluation was performed on a dataset from the same source [9] as the studies being compared The results compared with the study of Chanwoong Hwang and colleagues are shown in Table 9 showing that our model has a higher detection capacity

Table 10 compares the ability to detect DGA domains labeled 4,5,6,7,8,9,10 in the study of Yanchen Qiao and Duc Tran Yanchen Qiao [2] using LSTM with Attention mechanism Duc Tran's model [11] is a cost-sensitive original LSTM Cost items are class dependent, taking into account the importance of classification between classes Our model exhibits good detectability across four DGA families: Necurs, Qakbot, Ramnit, Ranbyus and lesser on the Shiotob family, tinba

Table 9 Comparison

Chanwoong

Hwang Proposed model

Tiêu đề	A Deep Learning Model That Detects The Domain Generated By The Algorithm In The Botnet
Tác giả	Nguyen Trung Hieu, Cao Chinh Nghia
Trường học	Hanoi Metropolitan University
Chuyên ngành	Informatics and Technology in Crime Prevention
Thể loại	scientific journal
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	12
Dung lượng	777,08 KB