HTTP-BASED BOTNET DETECTION USING NETWORK TRAFFIC TRACES A Dissertation Submitted to Southeast University For the Academic Degree of Doctor of Engineering BY TRUONG DINH TU Supervised
Trang 5HTTP-BASED BOTNET DETECTION USING
NETWORK TRAFFIC TRACES
A Dissertation Submitted to Southeast University For the Academic Degree of Doctor of Engineering
BY TRUONG DINH TU
Supervised by Prof CHENG Guang
School of Computer Science and Engineering
Southeast University
November 2015
Trang 13Abstract
Botnets are generally recognized as one of the most serious threats on the Internet today, because they serve as platforms for the vast majority of large-scale and coordinated cyber-attacks, such as distributed denial of service, spamming, and information stolen Detecting botnet is therefore of great importance and some security researchers have concerned about this threat and proposed many effective botnet detection approaches
However, botnet developers are constantly developing new techniques in order
to improve their bot and avoid the detection from security researchers In recent years, HTTP-based botnets have become more widespread and caused enormous damage to many government organizations and industries New generation HTTP botnets tend to use techniques called DGA (Domain Generation Algorithmically), domain-flux, or fast-flux to avoid the detection Some botnets use the domain-flux technique to evade from being blacklisted; some botnets use the fast-flux technique
to hide the true location of their servers
Therefore, the main research objective of this dissertation is to build solutions for detecting HTTP botnets that attackers often use techniques such as DGA, domain-flux or fast-flux to evade the detection To achieve these goals, the dissertation solves three main problems: (1) To detect the presence of domain-flux
or DGA-based botnets infected machines inside an enterprise network or the monitored network; (2) To detect C&C servers of botnets using domain-flux or DGA-based evasion techniques; (3) To detect malicious Fast-Flux Service Networks (FFSNs) The main contents of these three research works are summarized as follows:
The first problem is how to identify the presence of domain-flux or DGA-based botnets infected machines inside the enterprise network or the monitored network
To answer this question, multiple well-known domain-flux or DGA-based botnet samples are collected, such as Kraken, Zeus, Conficker, Bobax and Murofet botnets Then, we execute these bot samples in a virtual machine environment to obtain network traffic traces Through examining and analyzing on the large number of
Trang 14network traffic traces, we discover that these botnets exhibit many similar periodic behaviors in querying to domain names In addition, the evidence from this study shows that the domain-flux or DGA-based botnet infected machines often query a large number of the non-existent domain names with similar periodic time interval series to look for their C&C server The normal legitimate hosts have no reason to query a large number of different domain names with the similar periodic time interval series to yield high volumes of NX-Domains replies This similar behavior only occurs with the domain-flux or DGA-based botnet infected hosts Therefore, based on these characteristics, we propose a method based on analyzing correlation between each pair of time intervals series of queries to cluster the similarity of domain names The experiment results show that the domain names are generated by the same botnet or DGA are grouped into the same clusters The lists of hosts that tried to query to clusters of these domains are marked as compromised hosts running
a given domain-flux or DGA-based botnets This work is not comprehensive to detect all bot-infected machines It is only effective for detecting domain-flux or DGA-based bot infected machines inside the monitored network The results of this research motivate us to consider a new method to detect botnet C&C servers This research is a part in our next research works (in Chapter 4)
The second problem is how to detect C&C servers of domain-flux or based botnets Several previous approaches [1-4] have concerned about this threat and their strategies have brought the useful results Yadav et al [1] presented a technique to detect C&C domains of DGA-based botnets by looking at the distribution of unigrams and bigrams in all domain names However, the unigrams- and bigrams-based technique may not suffice, especially to detect domains generated by Kraken, Bobax or Murofet botnets due to the distributions of unigram and bigrams in all domains of these botnets are not significant difference compared
DGA-to those of benign domains To overcome this limitation, our works aim DGA-to improve and expand from the works of Yadav et al [1] We calculate frequency of occurrence
of grams (n=3, 4, 5) in benign domain names and then assign score for each gram, respectively To distinguish a domain generated by legitimate users or botnets,
n-we present a method to measure the expected score of domain (ESOD) and combine with two other features aiming to feed into a classifier that we previously trained to
Trang 15classify bot-generated domain names from human-generated ones We use five various machine learning algorithms to train classifiers and evaluate the detection effectiveness on each algorithm The experimental results show that the decision tree algorithm (J48) is the best classifier can be used to detect botnet more efficient than other algorithms The evidence from the experimental results has demonstrated that our proposed approach can be used to detect botnet in the monitored network efficiently The details of the method will be presented in Chapter 4 of the dissertation
The final problem is how to detect malicious fast-flux service networks use feature-based machine learning classification techniques There are some approaches have been developed to detect FFSN [5-8] Since the characteristics of FFSN is one or more domain names that are resolved to multiple (hundreds or even thousands) different IP addresses with short time-to-live, and the rapid (fast) change
in DNS answers Therefore, classification process needs to rely on data gathered by completely unpredictable timing of queries sent by various users The approaches that are proposed by [5-8] use a small amount of active DNS traffic traces, so it cannot obtain as many as possible resolved IP addresses of malicious fast-flux networks This disadvantage may enhance false positive and false negative rates However, this limitation may overcome if passive DNS replication method is installed In this study, we build a PassiveDNS tool to sniff traffic from an interface
or read a pcap-file and outputs the DNS-server answers to a log file (DNSlog) This
is a technique to reconstruct a partial view of the data available in the Domain Name System into a central database, where it can be indexed and queried The DNSlog databases are extremely useful for a variety of purposes, it can answer questions that are difficult or impossible to answer with the standard DNS protocol, such as where did this domain name point to in the past? What domain names are hosted by a given name-server? What domain names point into a given IP network? What subdomains exist below a certain domain name? We also define a DNSlog data aggregate aim to facilitate tracking and management of the query/response information related to each domain Moreover, Holz et al [7] focus on just three features derived from active DNS queries (i.e., the number of DNS "A" records, the number of DNS "NS" records, and the number of distinct Autonomous Systems (AS)) Passerini et al [8]
Trang 16employ 9 different features, while we use 16 key features to train classifiers Among the 16 introduced features, there are 12 features are first proposed in this dissertation The advantage of our approach is that it is able to detect a wide range of fast flux domains including malware domains with a significant detection effect The experimental results show that our method produces a lower false positive rate (FPR) (0.13%) compared to FPR of 6.17% produce by [7] and 4.08% produce by [8] The details of the method will be presented in Chapter 5 of this dissertation
Keywords: HTTP botnet, C&C Server, Domain Generation Algorithm (DGA),
Domain-Flux, Fast-Flux
Trang 17Table of Content
摘要 I Abstract V Table of Content IX List of Figures XIII List of Tables XV List of Abbreviations XVI
Chapter 1 Introduction 1
1.1 Botnet Definition 1
1.1.1 Bot and botnet 1
1.1.2 History of the Botnet 2
1.1.3 Botnet Architecture 4
1.1.4 Botnet lifecycle 8
1.2 Evolution of Botnet 11
1.2.1 IRC-Based Botnet 12
1.2.2 P2P-Based Botnet 12
1.2.3 HTTP-Based Botnet 13
1.3 Motivation and Challenges 14
1.4 The goal of the dissertation 16
1.5 Contributions and Outline of dissertation 16
1.5.1 Contributions 16
1.5.2 Outline of the Dissertation 19
Chapter 2 Background and Related Works 21
2.1 Botnet Detection Techniques 21
2.1.1 Honeypots-based detection 21
2.1.2 Anomaly-based Detection 23
2.1.3 DNS-based Detection 23
2.1.4 Mining-based Detection 25
Trang 182.2 Detection evasion techniques 26
2.2.1 DGA-Based technique 26
2.2.2 Fast Flux-Based technique 27
2.3 Related Works 31
Chapter 3 Detecting DGA-Bot Infected Machines Based On Analyzing The Similar Periodic Of Domain Queries 35
3.1 Introduction 35
3.2 Proposed methods 37
3.2.1 System Overview 37
3.2.2 Filtering DNS traffic 38
3.2.3 Similarity Analyzer 39
3.2.4 Clustering 41
3.3 Experiment Results 42
3.3.1 Bot samples collection 42
3.3.2 DNS traffic extraction 43
3.3.3 Detection and Clustering 47
3.4 Discussions 49
3.5 Conclusion and Future Work 50
Chapter 4 Detecting C&C Servers Of Botnet With Analysis Features Of Network Traffic 51
4.1 Introduction 51
4.2 Related Works 53
4.3 Proposed Approach 54
4.3.1 System Overview 54
4.3.2 Training Phase 55
4.3.3 Detecting Phase 57
4.3.4 Feature extraction 59
4.3.5 C&C Detection 61
4.4 Experimental and Evaluation 62
4.4.1 Prepare the Training Data Set 62
Trang 194.4.2 Evaluation of features selection 62
4.4.3 The Classifier Comparison 64
4.4.4 Evaluation of the detection rate on real-world DNS traffic 66
4.4.5 Compare with other approaches 71
4.5 Discussion 73
4.6 Conclusion 74
Chapter 5 Detecting Malicious Fast-Flux Service Networks Use Feature-Based Machine Learning Classification Techniques 75
5.1 Introduction 75
5.2 Related works 78
5.3 Proposed Methods 80
5.3.1 System Overview 80
5.3.2 Data Aggregate 81
5.3.3 Data Pre-filtering 83
5.3.4 Feature Extraction 85
5.4 Experiment and Evaluation 95
5.4.1 Data Set 95
5.4.2 Experimental Results 97
5.4.3 Compare with previous works 104
5.5 Conclusion 105
Chapter 6 Conclusion and Future Works 107
6.1 Summary of Research and Conclusions 107
6.2 Limitation and Future Work 109
Bibliography 111
Acknowledgements 111
List of Publications 123
Trang 21List of Figures
Figure 1.1: The attacks of typical botnet 1
Figure 1.2: IRC/HTTP botnet C&C architectures 4
Figure 1.3: P2P Botnet C&C architectures 7
Figure 1.4: C&C architectures of hybrid P2P botnet 8
Figure 1.5: The life cycle phases of botnet 9
Figure 1.6: Bots timeline 11
Figure 1.7: Bots trends 15
Figure 1.8: Outline of the dissertation 19
Figure 2.1: Normal content retrieval process 29
Figure 2.2: Single-flux content retrieval process 30
Figure 2.3: Double-flux content retrieval process 31
Figure 3.1: Framework of the detection system 37
Figure 3.2: DNS traffic is generated by DGA-bots 38
Figure 3.3: The diagram for submitting files to VirusTotal 42
Figure 3.4: The Virustotal reports the scan results 42
Figure 3.5: An example of the similar periodic time intervals of DNS queries 47
Figure 3.6: An example of clustering domains based on the estimated autocorrelation coefficients of DNS queries 48
Figure 3.7: Number of clusters with various thresholds 48
Figure 3.8: Number of distinct domain names on each cluster with the threshold (h=0.0266) 49 Figure 4.1: Framework of the detection system 55
Figure 4.2: Cumulative distribution function (CDF) for the features of benign and malicious domain names 64
Figure 4.3: Comparing prediction performance of classifiers based on ROC curve and AUC values 65
Figure 4.4: Comparison overall accuracy, precision and AUC between different classifiers 66
Trang 22Figure 4.5: The sequence of some similar periodic DNS queries of Zeus botnet 67 Figure 4.6: Clustering domains by similar periodic of DNS requests 68 Figure 4.7: Compare of the distribution between bigrams and 3-5grams characters of benign
and malicious domains 70 Figure 4.8: Compare the distribution of extracted feature values between benign and
malicious domain names 72 Figure 5.1: Overview of the proposed detection system 80 Figure 5.2: Comparing the distribution of feature values: a) F1 feature, b) F2 feature 86 Figure 5.3: Comparing the distribution of feature values: a) F3 feature, b) F4 feature 88 Figure 5.4: Comparing the distribution of F5a and F5b features 89 Figure 5.5: A view of the relationship between domains and IP addresses: (a) benign
domain; (b) malicious domain 91 Figure 5.6: Comparing the distribution of IP stability (F6) feature values 92 Figure 5.7: Comparing the distribution of feature: a) F10 feature, b) F11 feature 93 Figure 5.8: Comparing the mean and 95% confidence interval between legit, malware and
fast-flux domains for each the extracted feature 100 Figure 5.9: Classification accuracy and precision 102 Figure 5.10: ROC Curve compares the prediction performance of classifiers by signifying
the tradeoff between True Negative (TN) rate and True Positive (TP) rate 102
Trang 23List of Tables
Table 1.1: Summary of selective well-known botnets in history 3 Table 3.1: The bot samples collected 43 Table 3.2: The example of time intervals (seconds) of queries to domain names generated by
bots samples 45 Table 3.3: The estimated autocorrelation coefficients of time intervals series for different
lags (R=10) 46 Table 4.1: The Excerpt of some N-grams scores (N=3, 4, 5) in legitimate domain names
from Alexa Top 100,000 sites 60 Table 4.2: Compare the experimental results between classifiers using 10-fold CV and
percentage split 65 Table 4.3: The obtained DNS Traffic Data from our experiment 67 Table 4.4: The detection results for C&C domains 69 Table 4.5: The detection results of C&C server’s IP addresses 70 Table 4.6: The performance comparison between our approach and Ying et al 71 Table 5.1: Example of raw DNS traffic data sniffed from a network interface 82 Table 5.2: Example of some of the aggregate information in the queries and responses from
first seen to last seen 82 Table 5.3: List of selected features 85 Table 5.4: For example of some legitimate and malicious domain names using the selected
features (F11 is measured in hours) 94 Table 5.5: Labeled DAT dataset based on prior information regarding known flux domain,
known malware domains, and legitimate popular domains 96 Table 5.6: The detection results on dataset that was not seen previously 103 Table 5.7: Comparative study between [7], [8] and our proposal 104
Trang 24List of Abbreviations
Bot-master A bot-master is a person who operates the command and
control (C&C) center of botnets for remote process execution Botnet A botnet is a collection of compromised computers often
referred to as “zombies” infected with malware that allows an attacker to control them
Trang 25Chapter 1 Introduction
1.1 Botnet Definition
1.1.1 Bot and botnet
We begin with the definitions of “bot” and “botnet,” the key words of the dissertation A bot (robot) is a malicious software program installed in a vulnerable host that is capable of performing a series of malicious actions without the user’s consent The bot code is usually written by some criminal groups to perform many malicious attacks and activities Sometimes, the term “bot” is also used to refer to the bot-infected computer (or “compromised” or “zombie” computer) A family of bots is an aggregate of bots sharing the same source code In other words, these are different versions of the same bot, even if they are serviced by different C&C servers
Botmaster
C&C Server
C&C Server
bot
bot
bot
bot bot
bot bot
bot victim
Botnet
DDos Spam Phishing
Bank fraud Stealing information
Figure 1.1: The attacks of typical botnet
A botnet is a collection of bot-infected computers receiving and responding to commands from a server (the C&C server) under the control of an attacker (usually referred to as “botmaster” or “botherder” or “cybercriminals”) [9] The ultimate goal
of a botnet is to carry out certain profitable attack activities, such as sending spam,
Trang 26conducting distributed denial of service (DDoS) attacks, stealing personal information, bank fraud, phishing or other malicious activities (see Figure 1.1) Botnets are the root cause of many Internet attacks and have become one of the most dangerous threats to Internet security in recent years, since they can cause large-scale coordinated attacks and launch illicit activities using multiple infected computers
1.1.2 History of the Botnet
The concept of ‘botnet’ evolved in 1993 by introducing the first botnet called
‘Eggdrop’, was originally developed by Robey Pointer [10] Eggdrop is the world's most popular Open Source IRC (internet relay chat) bot, designed for flexibility and ease of use, and is freely distributable under the GNU General Public License (GPL) In the beginning, IRC bot was used for legitimate purposes, such as maintaining an IRC channel open when no user is logged in or maintaining control
of the IRC channel The first of malicious botnets is the new age of Trojan horses Botnets are the melding of many threats into one They are becoming a major tool for cybercrime, partly because they can be designed to disrupt targeted computer systems in different ways very effectively, and because a malicious user, without possessing strong technical skills, can initiate these disruptive effects in cyberspace
by simply renting botnet services from a cybercriminal
Based on the data from various sources, including: recent hearings on crime and terrorism [11]; lists of botnets that appear in large public websites and websites
of major IT firms (e.g., Microsoft); cybersecurity institutes (e.g., Symantec), and news agencies; and academic journals and conference proceedings The Table 1.1 describes briefly the history of bot samples were discovered from 1993 to 2014, including each botnet's name, alias, and its year of discovery, estimated number of bots, spam capacity and its communication type Table 1.1 may not completely contain all kinds of bots on the Internet, but it is a summary list of selective well-known bots
Trang 27Table 1.1: Summary of selective well-known botnets in history
Year
Botnet Alias
Estimated
no of bots
Spam capacity (billion/day)
Type Reference
2002 Sdbot/Rbbot
Agobot
IRC-SDBot W32.HLLW.Gaobot
[13] [14]
Sinit
- Win32.Sinit
[15] [16]
[18] [19] [20] [15]
Srizbi Storm
- Cbeplay, Exchanger Nuwar, Peacomm
1,300,000 450,000 160,000
-
60
3
IRC IRC P2P
[21] [15] [22]
2008 Mariposa
Torpig Lethic Kraken Sality Waledac
Conficker
Bobax Asprox Gumblar
Mega-D
- Sinowal, Anserin
- Kracken Sector, Kuku, Kookoo Waled, Waledpak DownadUp, Kido Bobic, Oderoor
-
- Ozdok
12,000,000 180,000 260,000 495,000 1,000,000 80,000 10,500,000+
185,000 15,000
- 509,000
[23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]
BredoLab
Donbot Wopla
Zbot, PRG, Wsnpoem Oficla Buzus, Bachsoy Pokier, Slogger
3,600,000 30,000,000 125,000 20,000
n/a 3.6 0.8 0.6
HTTP HTTP HTTP HTTP
[34] [35] [36] [37]
2010 Kelihos
LowSec
Hlux LowSecurity
300,000+
11,000+
4 0.5
P2P HTTP
[38] [30]
Trang 281.1.3 Botnet Architecture
1.1.3.1 Centralized C&C
The centralized C&C approach is similar to the traditional client/server architecture Typical examples of this type of botnet architecture are those implemented through the Internet Relay Chat (IRC) protocol [43] In a centralized C&C infrastructure, all bots establish a strong communication channel between one
or multiple connection points Servers are deployed on the connection points that are responsible for sending commands to bots and to provide malware updates IRC [43] and the Hyper Text Transfer Protocol (HTTP) [44] are considered the main protocols
in centralized architecture (see Figure 1.2)
Figure 1.2: IRC/HTTP botnet C&C architectures
The advantages of a centralized architecture are: (1) Direct feedback enables easy monitoring of the status of the bonnet by the botmaster (2) Low latency because each bot is issuing commands directly from the single server (3) Easy to construct and deploy, as the structure is simple and it does not require any specialized hardware (3) Quick reaction times, because the server is directly coordinating with its bots without being intervened by a third party
In the case of IRC botnets, a botmaster creates IRC channels to which the zombie machines will connect and then wait for commands to perform a malicious
Trang 29activity Even though the IRC protocol is very flexible and suitable for use in a C&C channel, it has serious limitations because it is normally easy to detect and turn off the operation of an IRC botnet Detection is facilitated because IRC traffic is not common and is rarely used in corporate networks; in fact, it is usually blocked Therefore, a network administrator may prevent IRC botnet activity simply by detecting IRC traffic in the network and blocking it with firewalls
Due to the restrictions on IRC traffic in corporate networks, the Hyper Text Transfer Protocol (HTTP) [44] became popular as a mechanism for implementing C&C communication The HTTP is the protocol most commonly used for the delivery of data over the Internet This includes human-readable content like websites and images, and binary data transported in uploads and downloads It is more favorable than the IRC protocol is because HTTP traffic is permitted in most networks connected to the Internet and is rarely filtered This is especially interesting for bot operators to disguise the communication between bots and botmaster According to Symantec Global Internet Security Threat Report [45] centralized C&C servers based on HTTP make up for the other 69% of all C&C servers and are therefore the most common way to control a botnet
The main disadvantage of centralized architecture is that: (1) The C&C server itself is a single point-of-failure for the entire botnet [16] If the central C&C server
is detected and blocked, the entire botnet is disbanded (2) Very low resiliency, the
IP lists of all bots contained in the server will reveal all bots' locations and make it easy to enumerate the number of bots in the botnet
To deal with this problem, new technique such as IP fluxing and domain fluxing have been adopted by botnets over time A typical example of botnets using HTTP for communication is Zeus, Conficker, which commonly rely on a technique called domain fluxing to generate a large number of Pseudo-random Domain Names (PDN) dynamically for botnet operators to control their bots This domain-fluxing strategy is an effective technique to evade detection by monitoring systems Because
Trang 30if one or more C&C domain names are identified and taken down, the bots relocate C&C domain name via DNS queries to the next set of automatically generated domains
1.1.3.2 Decentralized C&C
In a decentralized C&C architecture [46], have no centralized C&C infrastructure, the bots behave both as a server and a client A common term for this type of botnets is peer-to-peer (P2P) botnets, as this is the name of the corresponding network model The P2P botnet architecture is illustrated as in Figure 1.3 Each bot in the network can act as a server or as a client When acting as a client, a bot performs the attack against the victim computer If the bot assumes the role of a server, it distributes messages to the client bots whose addresses are on its peer list To issue a command to a P2P botnet, the botmaster will inject the command to several bots that she trusts These bots will execute the command and pass it to all bots listed in its peer list and so on
The advantages of a decentralized C&C architecture are: (1) high resiliency Due to the lack of centralized botnets and multiple communication paths between bot clients, the P2P botnets are highly resilient to shut down and hijacking Should a bot client is detected and monitored; it will only reveal at most other N bots in its peer list (2) There is no central C&C server to be found and disabled so if we discover several bots operating under a single botnet, it is not surely can destroy the entire botnet (3) Hard to enumerate, as it is often hard for security researchers to gage the overall size of a P2P botnet
The disadvantage of decentralized C&C architecture is that: (1) high latency Command latency is a problem of decentralized architecture botnets The delivery of commands is not guaranteed and could be delayed For example, if some bots are down or not available at the time of delivering commands, the botmaster may lose control of a significant part of the botnet, which cause difficulty for real time
Trang 31activities; (2) very high complexity P2P structure is very complicated and requires many efforts to plan, implement and operate well the botnet
Figure 1.3: P2P Botnet C&C architectures
1.1.3.3 Hybrid C&C
To overcome the limitations of the above architectures, bot operators may combine different architectures to increase survivability and sophistication of their attacks Hybrid architectures utilize the advantages of both centralized and P2P architectures (see Figure 1.4) The hybrid model is classified into two categories: servant bots and client bots [16] The servant, but acts as a client and a server simultaneously, which is configured with routable IP addresses (static IP); in contrast, the client but does not listen for incoming connections as configured with non-routable IP addresses (dynamic IP) Servant bots send IP address information to the peer list and stay in listening mode to detect the port for incoming connections Similarly, servant bots have additional responsibility to apply symmetric keys for each communication to stiffer the botnet detection mechanism
The advantages of Hybrid C&C architecture are: (1) high resiliency This architecture is very complicated with layers, therefore, increases the resilience, ability; (2) hard to enumerate It is hard for a security investigator to estimate the
Trang 32scale of the botnet
Figure 1.4: C&C architectures of hybrid P2P botnet
The disadvantages of Hybrid C&C architecture are: (1) very high complexity It requires an advance planning as it may consist of many layers of sub botnets It also requires many efforts to construct and maintain the botnet; (2) very high latency This botnet architecture suffers high latency issue since the communication must traverse multiple paths to get to the destination
Only a few botnets employ hybrid architecture due to its high complexity Among those are infamous Storm and its descendant Waledac [47]
1.1.4 Botnet lifecycle
In order for an infected host to become an active bot and part of a botnet, the host must go through a cycle of phases that integrate what is known as described in [15, 48-50] Such phases are sometimes identified by different names, but in general, the botnet life cycle is described as in Figure 1.5 Botnets can be different in sizes or structures, but, in general, they have to go through the same stages in their lifecycle
A Botnet lifecycle is defined as the sequence of phases a botnet needs to pass through in order to reach its goal Silva et al [51] classify the life cycle of a botnet into five phases (see Figure 1.5)
Trang 33Figure 1.5: The life cycle phases of botnet
The first phase is the Initial Infection, wherein a host is infected and becomes a potential bot During this phase, attackers attempt to infect victim machines through different ways A computer can be infected either directly over the network or indirectly through user interaction Users can inadvertently execute malicious code,
or attackers can exploit known system vulnerabilities to create a backdoor Users may also accidentally download and execute scripts or malicious programs while surfing a Web Site, opening an attachment from an email, or clicking a link in an incoming message, etc [48-50, 52]
The second phase, the secondary injection, requires that the first phase be
successfully completed After the successful initial infection (i.e., the victim
computer has been installed the malicious code), the next step is the infected computers execute a script known as a shell-code The shell-code fetches the image
of the actual bot binary from the specific location via FTP, HTTP, or P2P [49, 50] The bot binary installs itself on the target machine Once the bot program is installed, the victim computer becomes a botnet army, which is under control of a specific bot-master
The connection or rally phase: Rally [15] is known as the process of establishing a connection with the C&C; some authors call this procedure as the
Trang 34connection phase [49] In fact, this phase is scheduled every time the host is restarted to ensure the botmaster that the bot is taking part in the botnet and is able
to receive instructions, get commands, run commands, etc., to perform malicious activities Therefore, the connection or rally phase is likely to occur several times during the bot life cycle [53]
After establishing the command and control channel, the bot waits for commands to perform malicious activities [48-50, 52] Thus, the bot passes into phase 4 and is ready to perform an attack Malicious activities may be as wide ranging as information theft, performing DDoS attacks, spreading malware, extortion, stealing computer resources, monitoring network traffic, searching for vulnerable and unprotected computers, spamming, phishing, identity theft, manipulating games and surveys, etc [48, 54, 55]
The last phase of the bot life cycle is the maintenance and updating of the malware In this phase, in order to maintain the life of botnet, bot-master may need
to update their bots for several reasons First, bot-master may need to improve the bot binary to evade detection techniques from researchers Second, bot-master may need to update and add new functionalities to their bot army For example, the bot-master use a strategy called server migration and update binary to move the bots to a different C&C server This strategy is very useful for bot-masters to keep their botnet alive Bot-masters try to keep their botnets invisible and portable by using dynamic DNS to facilitate frequent updates and changes in server locations In case
if a C&C server at a certain IP address is disrupted, the bot-master can easily setup another C&C server with the same name at a different IP address Dynamic DNS providers update the changed IP address in C&C servers immediately to bots Accordingly, bots will migrate to the new C&C server location and will stay alive After bots are updated, they must establish new connections with the C&C infrastructure
Trang 351.2 Evolution of Botnet
In the first primitive botnet, the IP address or domain name of C&C server was hardcoded into the bot binary All the bots communicate with the assigned C&C server to receive and execute the commands from botmaster and to pass the harvested results [9, 56] However, this approach is not mobility and it is easy to take down a particular botnet by blocking particular address of C&C server Even though botmaster can obfuscate the address of the C&C server to prevent reverse engineering analysis, the hardcoded address is unchangeable, so it cannot provide any mobility Therefore, a single alarm or misuse report can also provoke to quarantine the C&C server or suspend the whole botnet
In order to overcome this disadvantage, the botmasters developed some new techniques that give them even more reliability on their C&C infrastructure Figure 1.6 shows a picture of a time that botnets appear IRC boats first appeared on the Internet in 1993 and developed to P2P (peer-to-peer) bots since 2003 Since 2006, it has been developed into HTTP bots (i.e., HTTP based bots, see Figure 1.6)
Figure 1.6: Bots timeline
Trang 361.2.1 IRC-Based Botnet
The first-generation botnets, Internet Relay Chat (IRC) was the C&C medium
of choice for botnets From 1993, several public IRC networks are in existence and IRC is widely deployed It has a simple command syntax base on text and provides almost real time communication between boys and C&C server A botnet-infected machine would connect to an IRC server and link a fixed channel to wait instruction from botmaster [52, 57] These bots were not sophisticated enough to be able to hide their behavior Botmasters usually protect the IRC channel with a password, but researchers could retrieve this information in the malicious binary, and join the channel to learn important information about the bonnet Another weakness of this model is that the message format of the IRC protocol is unique, making IRC traffic easily distinguishable from normal traffic Agobot, Spybot, and Sdbot are some popular IRC based botnets [58]
After the relative success of researchers in tackling the issue of IRC botnets, the next step of cyber criminals in botnet evolution was Peer-to-Peer (P2P) botnet communication
1.2.2 P2P-Based Botnet
To make their infrastructure more resilient, botmasters started developing the peer-to-peer botnet [59] since 2003 In this infrastructure, at the time of infection, the bot is provided with an initial list of peers This list intends a group of active peers in the botnet and is regularly updated The peer list can be hidden on the infected machine with an evasive name For example, Kelihos/Hlux botnet, stored its peer list in the Windows registry: HKEY CURRENT USER/Software/Google together with other configuration details; Some other botnets use method of binary hardcoding to combine seeding Nugache [19] is a good instance of such a botnet Initial seeding is done either by obtaining the list from a small set of default hosts hardcoded into the bot binary, or before actually running the malware, the victim
Trang 37machine’s Windows Registry was initially pre-seeding
At the beginning, P2P networks were developed to assist file sharing among peer nodes, but then it has been misused to use for botnet C&C communication Because commands can be dispersed by using any node in the p2p network, C&C server detection is very difficult In addition, p2p traffic classification is a difficult work for gateway security devices to filter and detect p2p traffic Sinit is the first-generation of p2p botnets, followed by Phatbot [17], Storm [22] and Nugache [19]
1.2.3 HTTP-Based Botnet
The evolution to HTTP bots began with proceeding from “exploit kits” (e.g Zeus Botnet, SpyEye Botnet) These kits developed mainly by Russian
cybercriminals [60] Because HTTP protocol is widely used protocol on the Internet,
most network traffic contains HTTP-related ports are allowed through firewalls [61] This is a convenient point for launching network attacks In traditional HTTP-based botnets, the attackers simply set the commands in a data file on a C&C server Then the bots frequently connect to their C&C servers with a predefined delay Today, bots not only just receive a command but also have the ability to gather personal data from the infected machine There are several well-known HTTP-based botnets, for example, Zeus [62] is a HTTP-based botnet designed mainly to steal financial data The bots of this botnet periodically connect to the C&C server with a URL, such as http:// /gate.php [62]
In addition, in the architecture of HTTP-based botnets, instead of connecting the C&C server directly, botmasters have developed a new technique that gives them
more reliability on their C&C infrastructure, called Domain-flux This technique is a
new generation of botnets that generate a large number of pseudo-random domain names, or Domain Generation Algorithmically (DGA) to protect their C&C infrastructures from takedowns [1] In order to contact the botmaster, each bot may use DGA to produce a list of candidate C&C domains The infected machine, then
Trang 38attempts to resolve these domain names by sending DNS queries until it gets a successful answer from the malicious domain name reserved in advance by the bot-master This Domain-flux strategy is an effective technique to evade detection by monitoring systems Because if one or more C&C domain names are identified and takedowns, the bots will relocate C&C domain name via DNS queries to the next set
of automatically generated domains [63, 64] BlackEnergy is the first-generation of HTTP-based botnet, followed by Conficker, Zeus, Bobax, etc (see Figure 1.6 and Table 1.1)
1.3 Motivation and Challenges
As shown in Figure 1.6, we have seen an evolution in the way of offensive cybercriminals from using IRC-based bots to HTTP-based bots In recent years, HTTP-based botnets have become more widespread than before (see Figure 1.7), and has caused enormous damage to many government organizations and industries through stealing personal information, DDoS Attack, spamming, etc
Currently, cybercriminals prefer using HTTP botnets compared with others, because HTTP-based botnet has some advantages following First, HTTP-based Botnet use the client-server model to communication and command and control (C&C) their bots, it can be easily built compared with P2P Botnet Second, because HTTP is a common, widely used protocol, using the HTTP protocol to build a communication channel with botnet can make the C&C flow submerge in a huge number of internet web traffic Therefore, bots can hide their communication flows among the normal HTTP flows Third, since HTTP service is a common service, firewalls rarely prevent the HTTP service, thus HTTP Botnets have a more flexible environment compared with others
Trang 39Figure 1.7: Bots trends
Meanwhile, HTTP Botnet detection researchers are facing the following difficulties and challenges as follows First, HTTP Botnets get commands from constructing legitimate HTTP requests based on the HTTP protocol, so botnet communication traffic is similar to the normal traffic Therefore, it is difficult to distinguish between botnet behavior and normal user behavior Second, due to the amount of traffic generated by HTTP Botnet is too small in a large network, so it will be not easy to find out malicious behavior from the huge traffic Third, the cyber criminals are constantly improving the resilience of their bots against takedown, takeover and detection efforts They use more sophisticated techniques for their bots to evade from being detected For example, recently, some botnets start
to use the fluxing technique such as DGA, domain-fluxing, fast-flux, etc., to hide from detection, and there is some researches about this topic [1, 65, 66] Finally, the number of researchers focusing on the detection of HTTP Botnets is relatively low compare to those of IRC-based and P2P Botnets These challenges are the mainly motivation for our study works in this dissertation
Trang 401.4 The goal of the dissertation
In the Section 1.3, we have identified the research motivation and challenges for HTTP botnet detection Figure 1.7 illustrates a trend that cybercriminals prefer using HTTP botnets compared with others, because using HTTP botnet has some advantages that we mentioned in Section 1.3 Moreover, botnet developers are constantly developing new techniques in order to improve their bots aim to avoid the detection from security researchers In recent years, new generation HTTP botnets tend to use techniques called DGA (Domain Generation Algorithmically), domain-flux, or fast-flux to avoid the detection Some botnets use the domain-flux technique
to evade from being blacklisted; some botnets use the fast-flux technique to hide the true location of their servers Therefore, the main goal of this dissertation is to build solutions for detecting HTTP botnets that attackers often use the evasion techniques such as DGA, domain-flux or fast-flux To achieve these goals, the dissertation solves three main problems follows:
(1) To detect the presence of domain-flux or DGA-based botnets infected machines in an enterprise network or the monitored network
(2) To detect C&C servers of botnets using domain-flux or DGA-based detection evasion techniques;
(3) To detect botnets based on malicious fast-flux service networks
The details of these three problems will be solved in Chapter 3, Chapter 4 and Chapter 5 of this dissertation, respectively
1.5 Contributions and Outline of dissertation
1.5.1 Contributions
According to the research goals mentioned in Section 1.4, this dissertation brings some important contributions to the fields of detecting botnet, including new approaches and investigating unexplored areas of research to detect botnet The main contributions of this dissertation are summarized as follows:
In Chapter 3, a new method based on analyzing the similar periodic of domain queries is proposed to be able to detect DGA-bot infected machines in an enterprise