VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY COMPUTER SCIENCE AND ENGINEERING FACULTY GRADUATION THESIS Machine Learning Approaches to Cyber Securi
Trang 1VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH UNIVERSITY OF TECHNOLOGY COMPUTER SCIENCE AND ENGINEERING FACULTY
GRADUATION THESIS
Machine Learning Approaches to Cyber Security
Department: Computer science
Nguyen Duc Kien 1552181
Trang 2ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HñA XÌ HỘI CHỦ NGHĨA VIỆT NAM
- Độc lập - Tự do - Hạnh phœc
TRƯỜNG ĐẠI HỌC BçCH KHOA
KHOA: KH&KT M‡y t’nh NHIỆM VỤ LUẬN çN TỐT NGHIỆP
BỘ MïN: Hệ thống & Mạng m‡y t’nh Chœ ý: Sinh vi•n phải d‡n tờ nˆy vˆo trang nhất của bản thuyết tr“nh
HỌ VË TæN: Huỳnh Kiến Văn MSSV: 1552423
Nguyễn Đức Ki•n MSSV: 1552181
NGËNH: Khoa học M‡y t’nh LỚP:
1 Đầu đề luận ‡n: Machine Learning Approaches for Cyber Security.
2 Nhiệm vụ (y•u cầu về nội dung vˆ số liệu ban đầu):
- Do reaearch on Machine Learning and its applications
- Do research on topics related to application of Machine Learning into cyber security
- Propose a way how to create an IDS (Intrusion Detection System) using Machine Learning
- Design the desired system as mentioned above
- Implement the system with using any programming language(s) and technologies, prove that they are suitable for the solution
- Demonstration the system to make sure it run properly and correctly
3 Ngˆy giao nhiệm vụ luận ‡n: 30/08/2021
4 Ngˆy hoˆn thˆnh nhiệm vụ: 31/12/2021
5 Họ t•n giảng vi•n hướng dẫn: TS Nguyễn Đức Th‡i
Nội dung vˆ y•u cầu LVTN đ‹ được th™ng qua Bộ m™n
Trang 3TRƯỜNG ĐẠI HỌC BçCH KHOA CỘNG HñA XÌ HỘI CHỦ NGHĨA VIỆT NAM
KHOA KH & KT MçY TêNH Độc lập - Tự do - Hạnh phœc
-
Ngˆy 28 th‡ng 12 năm 2021
PHIẾU CHẤM BẢO VỆ LVTN
(Dˆnh cho người hướng dẫn)
1 Họ vˆ t•n SV: Huỳnh Kiến Văn MSSV: 1552423
Nguyễn Đức Ki•n MSSV: 1552181
Ngˆnh (chuy•n ngˆnh): Computer Science
2 Đề tˆi: Machine Learning Approaches for Cyber Security
3 Họ t•n người hướng dẫn: Nguyễn Đức Th‡i
4 Tổng qu‡t về bản thuyết minh:
6 Những ưu điểm ch’nh của LVTN:
¥! Students completed a desired features of the thesis and demonstrated them
¥! The students applied machine learning algorithms to analyze the network traffics and
cybersecurity data
7 Những thiếu s—t ch’nh của LVTN:
¥! Many parts in the report are short and lack justifications
¥! Students provided evaluation of the received results, however, the evaluation was too short and
no discussion presented
8 Đề nghị: Được bảo vệ R Bổ sung th•m để bảo vệ o Kh™ng được bảo vệ o
9 3 c‰u hỏi SV phải trả lời trước Hội đồng:
10 Їnh gi‡ chung (bằng chữ: giỏi, kh‡, TB): Điểm :
Huỳnh Kiến Văn 8.2/10
Nguyễn Đức Ki•n 7/10
Ký t•n (ghi r› họ t•n)
Nguyễn Đức Th‡i
Trang 4VT姶云PI"A萎K"J窺E"DèEJ"MJQC E浦PI"JñC"ZÊ"J浦K"EJ曳"PIJ C"XK烏V"PCO
KHOA KH & KT MÁY TÍNH A瓜e"n壱r"- V詠"fq"- J衣pj"rj¿e
- Ngày 27 tháng 12 p<o 2021
RJK蔭W"EJ遺O"D謂Q"X烏"NXVP
*F pj"ejq"pi⇔ぜk"j⇔ずpi"fdp1rjVp"dkうp+
30"J丑"x tên SV: Huynh Kien Van -1552423 Nguyen Duc Kien -1552181
Ngành (chuyên ngành): Computer Science
40"A隠"v k< MACHINELEARNINGAPPROACHESTOCYBER SECURITY
50"J丑"v‒p"pi逢運k"j逢噂pi"f磯p1rj違p"dk羽p< Nguy 宇p"Nê Duy Lai
7 Nh英pi"vjk院w"u„v"ej pj"e栄c"NXVP<
However, the presented topic is still limited including the inability to support a complex prediction model The title of this thesis seems to be very large and authors need to give a concise scope on problems and the approach to solutions The fundamental of networks in Section 2.1 retains very elementarily that may not be necessary for this context There are some limitations to this approach such as encrypted packets are not processed by most intrusion detection devices
:0"A隠"pij鵜<"A逢嬰e"d違q"x羽"¸ D鰻"uwpi"vj‒o"8吋"d違q"x羽"¸ Mj»pi"8逢嬰e"d違q"x羽"¸
;0"5"e¤w"j臼k"UX"rj違k"vt違"n運k"vt逢噂e"J瓜k"8欝pi<
a How do think that the attackers can use some techniques to evade IDS such as Fragmentation, Avoiding defaults, Coordinated, low-bandwidth attacks, Address spoofing/proxying, Pattern change evasion Your IDS integrating ML module can be immune to these evasion techniques?
Trang 5VIETNAME UNIVERSITY OF TECHNOLOGY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE & ENGINEERNG
BACHELOR OF ENGINEERING THESIS
MACHINE LEARNING APPROACHES TO CYBER
SECURITY
COMPUTER SCIENCE COMMITTEEAdvisor: Prof Nguyen Duc ThaiExaminer: Prof Nguyen Le Duy Lai
Students: Huynh Kien Van - 1552423
Nguyen Duc Kien - 1552181
Ho Chi Minh, 2021
Trang 6We guarantee that the work in this dissertation was completed in accordance with theUniversity’s regulations and that it has not been submitted to any other academic institutions.The works are our own, unless otherwise stated in the text by a particular reference
1
Trang 7First and foremost, we would like to express our special thanks of gratitude to oursupervisor, professor Nguyen Duc Thai for his never ending grace His guidance andvalue knowledge helped us in all the time of writing thesis Also, professor Nguyen DucThai supports us in expertise and spirit to work on the thesis
We also extend our grateful to our families and friends who have always been beside us
in hard moments and encouraged us in this thesis and university life
Trang 8In this thesis, we are proposing machine learning-based approach to detect lively networktraffic To increase the accuracy as well as reducing False-Negative cases, we apply theDeep Learning model We are building RNN models: LSTMs and GRU to classify anetwork traffic if malicious or normal
Technically, we are building RNN model run parallel with IDS and combining theresults and consider which actions which actions following the decision table
Dataset used in this thesis mainly came from MTA-KDD19’ which was created byproject have the same name To enrich our data, we are also using dataset ISCX2012[1]and USTC-TFC2016[2], then preprocessing following the stagegy of MTA-KDD’19 work
For all result, the LSTM model is performced better than the GRU model For accuracy, theLSTM model even higher than the work of MTA-KDD’19 which used the traditional neuralnetwork, 99.8% compare to 99.74 For Prediction, the LSTM reach 98.3% and the GRUreach 99.5% Our goal is eliminate the False-Negative, so the results of Recall score ofthese two model is 99.75% (for GRU model) and 99.8% (for LSTM model), respectively
3
Trang 91.1 Overview 10
1.2 Objective and scope 10
1.3 Thesis structure 11
2 BACKGROUND KNOWLEDGE 12 2.1 Fundamental of network 13
2.1.1 Networking concept 13
2.1.2 Reference models 13
2.2 Intrusion Detection System 17
2.3 Word Embedding 19
2.4 Deep Neural Network 19
2.4.1 Recurrent Neural Network 19
2.4.2 Long Short Term Memory 21
3 LITERATURE REVIEW 23 3.1 Deep learning-based approach in improvement signature of IDSs 24
3.2 Deep learning-based approaches for classifying network traffic 24
4 PROPOSED APPROACHES 25 4.1 Problem statements 26
4.2 Proposed approach 26
4.3 Design 27
5 DATASET 28 6 IMPLEMENTATION 31 6.1 Data pre-processing 32
6.1.1 Explaining features of MTA-KDD’19 dataset 32
6.1.2 Data processing 32
6.2 Prediction module 33
Trang 10Chapter 0
7.1 Evaluation methods 36
7.1.1 Data preperation 36
7.1.2 Confusing matrix 36
7.1.3 Accuracy 36
7.1.4 Precision 37
7.2 Model evaluation 37
Appendices
5
Trang 11List of Figures
2.1 The OSI model 15
2.2 The TCP model 17
2.3 Convolution Neural Network Architecture 19
2.4 General Neural Network 20
2.5 LSTM 22
4.1 Traffic validator architecture 27
5.1 Formula 29
5.2 functions 30
6.1 Data processing task 32
6.2 The full network model 34
Trang 12List of Tables
4.1 Decision table for IDS and Prediction model 275.1 Summary of benign and malware traffic in USTC-TFC2016 dataset 287.1 Confusion matrix with normalize 367.2 Model evaluation 377.3 Comparing models 37
7
Trang 13List of Abbreviations
IDS Intrusion Detection System
TCP/IP Transmission Control Protocol/Internet Protocol
Trang 14INTRODUCTION
Contents
1.1 Overview 10 1.2 Objective and scope 10 1.3 Thesis structure 11
Trang 15Chapter 1
Many parts of the world have changed as a result of our increased reliance on technology.The more technology advances, the more valuable data and information become.Furthermore, as computer networks become more complex, the number of cybersecurity risks increases, with a wide range of sophistication, making it more difficult forprofessionals to detect and defend against dangers posed by billions of network traffics
Cyber attacks can result in data breaches and significant financial losses According toThe World Economic Forum1 (WEF), they estimate the economic cost of cybercrime to
be $3 trillion worldwide in 2015 and totaling $6 trillion USD globally in 2021
Threats detection and prevention mostly depends on Intrusion Detection System (IDS) orIntrusion Prevention System (IPS) by analyzing network traffic for signatures that matchknown cyber attacks The line between Intrusion Detection and Intrusion PreventionSystems (IDS and IPS respectively) has become increasingly blurred Currently, Signature-based IDSs are more common since they are reliable and used by many organizations.That being said, traditional signature-based IDSs are disposed towards False-Negative(FN) and impossible to identify novel attacks like zero-day exploit2 since it identifiesattacks based on known attack signatures A term of false state is the most serious anddangerous state This is when the IDS identifies an activity as acceptable when the activity
is actually an attack3 In the other words, a False-Negative is when the IDS fails to catch
an attack
For the last few years, the strong development of machine learning has had a hugecontribution in automatically behavior anomaly detection In this thesis, we propose
a machine learning approach to this field By applying machine learning approach,
we tend to build an intelligent system which classifies threats and threat actors thathelps detect potential attacks as well as improve the domain of computer securityand provide better protection
In this thesis, we want to apply machine learning approaches to analyse millions of the livenetwork traffic Specialists may obtain a deeper understanding of each situation thanks
to artificial intelligence automation, which improves network security performance andreliability
With Traffic validator, we expect to validate the incoming traffic into benign and malicious
classes The network traffic have been filtered through a rule-based IDS such as Snort,and our model is an add-on to IDSs that aims to eliminate rule-based IDS false negative.1Available at https://reports.weforum.org
2A zero-day attack (also referred to as Day Zero) is an attack that exploits a potentially serious software security weakness that the vendor or developer may be unaware of
3Available at https://owasp.org/www-community/controls/Intrusion_Detection
Trang 16malicious network traffic module with the details of processing of raw datasets.
11
Trang 17BACKGROUND KNOWLEDGE
Contents
2.1 Fundamental of network 13
2.1.1 Networking concept 13
2.1.2 Reference models 13
2.2 Intrusion Detection System 17
2.3 Word Embedding 19
2.4 Deep Neural Network 19
2.4.1 Recurrent Neural Network 19
2.4.2 Long Short Term Memory 21
Trang 18Chapter 2
To understand anomaly detection in networks, we must have a good understanding ofbasic network concepts Therefore, The first part of this chapter discusses networkingconcept and types of networks
A network is a complex interacting system, composed of many individual entities Two ormore computer systems that can send or receive data from each other through a mediumthat they share and access are said to be connected The behavior of the individual entitiescontributes to the ensemble behavior of the entire network In a computer network,
there are generally three communicating entities: Users: Humans who perform various activities on the network such as browsing Web pages and shopping, Hosts: Computers, each one of which is identified with a unique address, say an IP address and Processes:
Instances of executable programs used in a client–server architecture Client processesrequest server(s) for a network service, whereas the server processes provide the requestedservices to the clients An example client is a Web browser that requests pages from
a Web server, which responds to the requests
To reduce network engineering, the whole networking concept is divided into multiplelayers Each layer is involved in some particular task and is independent of all other layers.But as a whole, almost all networking tasks depend on all of these layers Layers sharedata between them and they depend on each other only to take input and send output
Layered architecture
In layered architecture of Network Model, one whole network process is divided into smalltasks Each small task is then assigned to a particular layer which works dedicatedly toprocess the task only Every layer does only specific work The rules and conventions usedare collectively referred to as the layer-n protocol A protocol represents an agreement
as to how communications take place between two parties A protocol is a set of rulesused to govern the meaning and format of packets, or messages exchanged between two ormore peer entities, as well as actions taken when a message is transmitted or received,and in certain other situations Protocols are extensively used by computer networks
In layered communication system, one layer of a host deals with the task done by
or to be done by its peer layer at the same level on the remote host The task is eitherinitiated by layer at the lowest level or at the top most level If the task is initiated bythe-top most layer, it is passed on to the layer below it for further processing The lowerlayer does the same thing, it processes the task and passes on to lower layer If the task isinitiated by lower most layer, then the reverse path is taken There are two well-knowntypes: The open systems interconnection (OSI) reference model and the transmissioncontrol protocol/Internet protocol (TCP/IP) reference model
13
Trang 19Chapter 2
The ISO1 OSI Reference Model
The Open Systems Interconnection model (OSI model2) is a conceptual model created bythe International Organization for Standardization which is characterizes and standardizes
a telecommunications or computing system’s communication operations regardless ofits underlying internal structure and technology Its objective is to ensure that differentcommunication systems can communicate with each other using standard communicationprotocols
From the practical implementation of transferring bits through a communicationschannel to the highest-level representation of data in a distributed application, themodel divides the flow of data in a communication system into seven abstraction levels.Each intermediary layer provides a class of functionality to the layer above it while alsoreceiving service from the layer below Standard communication protocols are used
to implement classes of functionality in software
• Application layer: This layer provides users access to the OSI environment and todistributed information services
• Presentation layer: This layer provides independence to application processes fromdifferences in data representation (syntax)
• This layer provides independence to application processes from differences in datarepresentation (syntax)
• Session layer: It provides the control structure for communication betweenapplications Establishment, management and termination of connections (sessions)between cooperating applications are the major responsibilities of this layer
• Transport layer: This layer supports reliable and transparent transfer of data betweentwo end points It also supports end-to-end error recovery and flow control
• Network layer: This layer provides upper layers independence from data transmissionand switching technologies used to connect systems It is also responsible forthe establishment, maintenance and termination of connections Data link layer:The responsibility of reliably transferring information across the physical link isassigned to this layer It transfers blocks (frames) with necessary synchronization,error control and flow control
• Physical layer: This layer is responsible for transmitting a stream of unstructuredbits over the physical medium It must deal with mechanical, electrical, functionaland procedural issues to access the physical medium
1International Organization for Standardization
2Available at https://www.cloudflare.com/
Trang 20Chapter 2
Figure 2.1: The OSI reference model
TCP/IP 3 Reference Model
The functions of the TCP/IP architecture are divided into five layers The functions oflayers 1 and 2 are supported by bridges, and layers 1 through 3 are implemented by routers
• The application layer: The application layer is responsible for supporting networkapplications As computer and networking technologies change, the types ofapplications supported by this layer also change Each application, such as filetransmission, has its own specific module Many protocols are included in theapplication layer for diverse reasons, such as hyper-text transport protocol (HTTP)for Web searches, simple mail transfer protocol (SMTP) for electronic mail, andfile transfer protocol (FTP) for file transfer
• The transport layer: Application layer communications are transported between theclient and the server via the transport layer With two transport protocols: TCP andUDP (user datagram protocol), application layer communications can be transportedusing these protocols To transmit application layer messages to the destination,TCP offers a guaranteed connection-oriented service TCP also has a congestionmanagement mechanism that allows a source to limit its transmission pace amidnetwork congestion by segmenting a large message into shorter parts The UDP3Available at https://docs.oracle.com/cd/E19683-01/806-4075/ipov-10/index.html
15
Trang 21in the IP datagram that dictate how end systems and routers operate This layer alsocomprises routing protocols for determining the paths for sending packets between
a source and a destination A transport layer segment and a destination address areforwarded to the next layer protocol, IP, by the TCP or UDP protocol in a sourcehost The IP layer is now in charge of delivering the segment to the target host.The IP passes the segment to the transport layer within the host when the packetarrives at the destination host A packet is routed through a series of packet switchesbetween two sources and destinations by the internet layer
• The network layer: When forwarding a packet from one node (host or packet switch)
to the next, the network layer relies on the link layer’s services The network layer,
in particular, transmits the datagram down to the link layer at each node, whichdelivers it to the next node along the path Depending on the link layer protocolused on the link, the link layer transfers the datagram up to the network layer at thenext node
• The physical layer: is responsible for moving the bits in a frame from one node tothe next The protocols on this layer are dependent on the link as well as the actualtransmission medium (e.g., coaxial cable, twisted pair, etc.) Ethernet providesphysical layer protocols for each type of communication channel Protocols fortwisted-pair copper wire, coaxial cable, and optical fibers, for example, are alldistinct For each of these protocols, bit forwarding takes place in a different wayover the connection
Trang 22Chapter 2
Figure 2.2: The TCP/IP architecture compare to the OSI architecture
An IDS is a device or software application which used to monitor network trafficfor suspicious activity, threats consideration and issues alerts in an entire networkingenvironment when such activity is discovered An IDS typically protects an entirenetwork from attacks such as Scanning Attacks, Asymmetric Routing, Buffer OverflowAttacks, Protocol-Specific Attacks, Malware and special one - Traffic Flooding IDSprovide defense at Network Layer (different from WAF 4 which provides defense atApplication Layer) There are two main types of intrusion detection software: NetworkIntrusion Detection System (NIDS) and Host Intrusion Detection System (HIDS)
By deploying IDS at strategic points throughout the network, it intended to coverthose places where traffic is most likely to be vulnerable to attack It observes networktraffic passing through the places on the network where it is deployed in a passive manner
At most basic level, NIDS examine network traffic, whereas HIDS examine actions anddata on the host devices
An IDS operates through a set of rules or signatures IDSs often look for known attacksignatures or aberrant deviations from predetermined standards These anomalous networktraffic patterns are subsequently forwarded up the stack to the protocol and applicationlayers of the OSI (Open Systems Interconnection) model for further investigation These4Web Application Firewall
17