Machine learning approaches to cyber security

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY COMPUTER SCIENCE AND ENGINEERING FACULTY GRADUATION THESIS Machine Learning Approaches to Cyber Securi

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY COMPUTER SCIENCE AND ENGINEERING FACULTY

GRADUATION THESIS

Machine Learning Approaches to Cyber Security

Department: Computer science

Nguyen Duc Kien 1552181

Trang 2

ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HñA XÌ HỘI CHỦ NGHĨA VIỆT NAM

- Độc lập - Tự do - Hạnh phœc

TRƯỜNG ĐẠI HỌC BçCH KHOA

KHOA: KH&KT M‡y t’nh NHIỆM VỤ LUẬN çN TỐT NGHIỆP

BỘ MïN: Hệ thống & Mạng m‡y t’nh Chœ ý: Sinh vi•n phải d‡n tờ nˆy vˆo trang nhất của bản thuyết tr“nh

HỌ VË TæN: Huỳnh Kiến Văn MSSV: 1552423

Nguyễn Đức Ki•n MSSV: 1552181

NGËNH: Khoa học M‡y t’nh LỚP:

1 Đầu đề luận ‡n: Machine Learning Approaches for Cyber Security.

2 Nhiệm vụ (y•u cầu về nội dung vˆ số liệu ban đầu):

- Do reaearch on Machine Learning and its applications

- Do research on topics related to application of Machine Learning into cyber security

- Propose a way how to create an IDS (Intrusion Detection System) using Machine Learning

- Design the desired system as mentioned above

- Implement the system with using any programming language(s) and technologies, prove that they are suitable for the solution

- Demonstration the system to make sure it run properly and correctly

3 Ngˆy giao nhiệm vụ luận ‡n: 30/08/2021

4 Ngˆy hoˆn thˆnh nhiệm vụ: 31/12/2021

5 Họ t•n giảng vi•n hướng dẫn: TS Nguyễn Đức Th‡i

Nội dung vˆ y•u cầu LVTN đ‹ được th™ng qua Bộ m™n

Trang 3

TRƯỜNG ĐẠI HỌC BçCH KHOA CỘNG HñA XÌ HỘI CHỦ NGHĨA VIỆT NAM

KHOA KH & KT MçY TêNH Độc lập - Tự do - Hạnh phœc

-

Ngˆy 28 th‡ng 12 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN

(Dˆnh cho người hướng dẫn)

1 Họ vˆ t•n SV: Huỳnh Kiến Văn MSSV: 1552423

Nguyễn Đức Ki•n MSSV: 1552181

Ngˆnh (chuy•n ngˆnh): Computer Science

2 Đề tˆi: Machine Learning Approaches for Cyber Security

3 Họ t•n người hướng dẫn: Nguyễn Đức Th‡i

4 Tổng qu‡t về bản thuyết minh:

6 Những ưu điểm ch’nh của LVTN:

¥! Students completed a desired features of the thesis and demonstrated them

¥! The students applied machine learning algorithms to analyze the network traffics and

cybersecurity data

7 Những thiếu s—t ch’nh của LVTN:

¥! Many parts in the report are short and lack justifications

¥! Students provided evaluation of the received results, however, the evaluation was too short and

no discussion presented

8 Đề nghị: Được bảo vệ R Bổ sung th•m để bảo vệ o Kh™ng được bảo vệ o

9 3 c‰u hỏi SV phải trả lời trước Hội đồng:

10 Đ‡nh gi‡ chung (bằng chữ: giỏi, kh‡, TB): Điểm :

Huỳnh Kiến Văn 8.2/10

Nguyễn Đức Ki•n 7/10

Ký t•n (ghi r› họ t•n)

Nguyễn Đức Th‡i

Trang 4

VT姶云PI"A萎K"J窺E"DèEJ"MJQC E浦PI"JñC"ZÊ"J浦K"EJ曳"PIJ C"XK烏V"PCO

KHOA KH & KT MÁY TÍNH A瓜e"n壱r"- V詠"fq"- J衣pj"rj¿e

- Ngày 27 tháng 12 p<o 2021

RJK蔭W"EJ遺O"D謂Q"X烏"NXVP

*F pj"ejq"pi⇔ぜk"j⇔ずpi"fｄp1rjＶp"dkうp+

30"J丑"x tên SV: Huynh Kien Van -1552423 Nguyen Duc Kien -1552181

Ngành (chuyên ngành): Computer Science

40"A隠"v k< MACHINELEARNINGAPPROACHESTOCYBER SECURITY

50"J丑"v‒p"pi逢運k"j逢噂pi"f磯p1rj違p"dk羽p< Nguy 宇p"Nê Duy Lai

7 Nh英pi"vjk院w"u„v"ej pj"e栄c"NXVP<

However, the presented topic is still limited including the inability to support a complex prediction model The title of this thesis seems to be very large and authors need to give a concise scope on problems and the approach to solutions The fundamental of networks in Section 2.1 retains very elementarily that may not be necessary for this context There are some limitations to this approach such as encrypted packets are not processed by most intrusion detection devices

:0"A隠"pij鵜<"A逢嬰e"d違q"x羽"¸ D鰻"uwpi"vj‒o"8吋"d違q"x羽"¸ Mj»pi"8逢嬰e"d違q"x羽"¸

;0"5"e¤w"j臼k"UX"rj違k"vt違"n運k"vt逢噂e"J瓜k"8欝pi<

a How do think that the attackers can use some techniques to evade IDS such as Fragmentation, Avoiding defaults, Coordinated, low-bandwidth attacks, Address spoofing/proxying, Pattern change evasion Your IDS integrating ML module can be immune to these evasion techniques?

Trang 5

VIETNAME UNIVERSITY OF TECHNOLOGY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTER SCIENCE & ENGINEERNG

BACHELOR OF ENGINEERING THESIS

MACHINE LEARNING APPROACHES TO CYBER

SECURITY

COMPUTER SCIENCE COMMITTEEAdvisor: Prof Nguyen Duc ThaiExaminer: Prof Nguyen Le Duy Lai

Students: Huynh Kien Van - 1552423

Nguyen Duc Kien - 1552181

Ho Chi Minh, 2021

Trang 6

We guarantee that the work in this dissertation was completed in accordance with theUniversity’s regulations and that it has not been submitted to any other academic institutions.The works are our own, unless otherwise stated in the text by a particular reference

1

Trang 7

First and foremost, we would like to express our special thanks of gratitude to oursupervisor, professor Nguyen Duc Thai for his never ending grace His guidance andvalue knowledge helped us in all the time of writing thesis Also, professor Nguyen DucThai supports us in expertise and spirit to work on the thesis

We also extend our grateful to our families and friends who have always been beside us

in hard moments and encouraged us in this thesis and university life

Trang 8

In this thesis, we are proposing machine learning-based approach to detect lively networktraﬃc To increase the accuracy as well as reducing False-Negative cases, we apply theDeep Learning model We are building RNN models: LSTMs and GRU to classify anetwork traﬃc if malicious or normal

Technically, we are building RNN model run parallel with IDS and combining theresults and consider which actions which actions following the decision table

Dataset used in this thesis mainly came from MTA-KDD19’ which was created byproject have the same name To enrich our data, we are also using dataset ISCX2012[1]and USTC-TFC2016[2], then preprocessing following the stagegy of MTA-KDD’19 work

For all result, the LSTM model is performced better than the GRU model For accuracy, theLSTM model even higher than the work of MTA-KDD’19 which used the traditional neuralnetwork, 99.8% compare to 99.74 For Prediction, the LSTM reach 98.3% and the GRUreach 99.5% Our goal is eliminate the False-Negative, so the results of Recall score ofthese two model is 99.75% (for GRU model) and 99.8% (for LSTM model), respectively

3

Trang 9

1.1 Overview 10

1.2 Objective and scope 10

1.3 Thesis structure 11

2 BACKGROUND KNOWLEDGE 12 2.1 Fundamental of network 13

2.1.1 Networking concept 13

2.1.2 Reference models 13

2.2 Intrusion Detection System 17

2.3 Word Embedding 19

2.4 Deep Neural Network 19

2.4.1 Recurrent Neural Network 19

2.4.2 Long Short Term Memory 21

3 LITERATURE REVIEW 23 3.1 Deep learning-based approach in improvement signature of IDSs 24

3.2 Deep learning-based approaches for classifying network traﬃc 24

4 PROPOSED APPROACHES 25 4.1 Problem statements 26

4.2 Proposed approach 26

4.3 Design 27

5 DATASET 28 6 IMPLEMENTATION 31 6.1 Data pre-processing 32

6.1.1 Explaining features of MTA-KDD’19 dataset 32

6.1.2 Data processing 32

6.2 Prediction module 33

Trang 10

Chapter 0

7.1 Evaluation methods 36

7.1.1 Data preperation 36

7.1.2 Confusing matrix 36

7.1.3 Accuracy 36

7.1.4 Precision 37

7.2 Model evaluation 37

Appendices

5

Trang 11

List of Figures

2.1 The OSI model 15

2.2 The TCP model 17

2.3 Convolution Neural Network Architecture 19

2.4 General Neural Network 20

2.5 LSTM 22

4.1 Traﬃc validator architecture 27

5.1 Formula 29

5.2 functions 30

6.1 Data processing task 32

6.2 The full network model 34

Trang 12

List of Tables

4.1 Decision table for IDS and Prediction model 275.1 Summary of benign and malware traﬃc in USTC-TFC2016 dataset 287.1 Confusion matrix with normalize 367.2 Model evaluation 377.3 Comparing models 37

7

Trang 13

List of Abbreviations

IDS Intrusion Detection System

TCP/IP Transmission Control Protocol/Internet Protocol

Trang 14

INTRODUCTION

Contents

1.1 Overview 10 1.2 Objective and scope 10 1.3 Thesis structure 11

Trang 15

Chapter 1

Many parts of the world have changed as a result of our increased reliance on technology.The more technology advances, the more valuable data and information become.Furthermore, as computer networks become more complex, the number of cybersecurity risks increases, with a wide range of sophistication, making it more diﬃcult forprofessionals to detect and defend against dangers posed by billions of network traﬃcs

Cyber attacks can result in data breaches and signiﬁcant ﬁnancial losses According toThe World Economic Forum1 (WEF), they estimate the economic cost of cybercrime to

be $3 trillion worldwide in 2015 and totaling $6 trillion USD globally in 2021

Threats detection and prevention mostly depends on Intrusion Detection System (IDS) orIntrusion Prevention System (IPS) by analyzing network traffic for signatures that matchknown cyber attacks The line between Intrusion Detection and Intrusion PreventionSystems (IDS and IPS respectively) has become increasingly blurred Currently, Signature-based IDSs are more common since they are reliable and used by many organizations.That being said, traditional signature-based IDSs are disposed towards False-Negative(FN) and impossible to identify novel attacks like zero-day exploit2 since it identifiesattacks based on known attack signatures A term of false state is the most serious anddangerous state This is when the IDS identifies an activity as acceptable when the activity

is actually an attack3 In the other words, a False-Negative is when the IDS fails to catch

an attack

For the last few years, the strong development of machine learning has had a hugecontribution in automatically behavior anomaly detection In this thesis, we propose

a machine learning approach to this ﬁeld By applying machine learning approach,

we tend to build an intelligent system which classiﬁes threats and threat actors thathelps detect potential attacks as well as improve the domain of computer securityand provide better protection

In this thesis, we want to apply machine learning approaches to analyse millions of the livenetwork traﬃc Specialists may obtain a deeper understanding of each situation thanks

to artiﬁcial intelligence automation, which improves network security performance andreliability

With Traﬃc validator, we expect to validate the incoming traﬃc into benign and malicious

classes The network traﬃc have been ﬁltered through a rule-based IDS such as Snort,and our model is an add-on to IDSs that aims to eliminate rule-based IDS false negative.1Available at https://reports.weforum.org

2A zero-day attack (also referred to as Day Zero) is an attack that exploits a potentially serious software security weakness that the vendor or developer may be unaware of

3Available at https://owasp.org/www-community/controls/Intrusion_Detection

Trang 16

malicious network traﬃc module with the details of processing of raw datasets.

11

Trang 17

BACKGROUND KNOWLEDGE

Contents

2.1 Fundamental of network 13

2.1.1 Networking concept 13

2.1.2 Reference models 13

2.2 Intrusion Detection System 17

2.3 Word Embedding 19

2.4 Deep Neural Network 19

2.4.1 Recurrent Neural Network 19

2.4.2 Long Short Term Memory 21

Trang 18

Chapter 2

To understand anomaly detection in networks, we must have a good understanding ofbasic network concepts Therefore, The ﬁrst part of this chapter discusses networkingconcept and types of networks

A network is a complex interacting system, composed of many individual entities Two ormore computer systems that can send or receive data from each other through a mediumthat they share and access are said to be connected The behavior of the individual entitiescontributes to the ensemble behavior of the entire network In a computer network,

there are generally three communicating entities: Users: Humans who perform various activities on the network such as browsing Web pages and shopping, Hosts: Computers, each one of which is identiﬁed with a unique address, say an IP address and Processes:

Instances of executable programs used in a client–server architecture Client processesrequest server(s) for a network service, whereas the server processes provide the requestedservices to the clients An example client is a Web browser that requests pages from

a Web server, which responds to the requests

To reduce network engineering, the whole networking concept is divided into multiplelayers Each layer is involved in some particular task and is independent of all other layers.But as a whole, almost all networking tasks depend on all of these layers Layers sharedata between them and they depend on each other only to take input and send output

Layered architecture

In layered architecture of Network Model, one whole network process is divided into smalltasks Each small task is then assigned to a particular layer which works dedicatedly toprocess the task only Every layer does only speciﬁc work The rules and conventions usedare collectively referred to as the layer-n protocol A protocol represents an agreement

as to how communications take place between two parties A protocol is a set of rulesused to govern the meaning and format of packets, or messages exchanged between two ormore peer entities, as well as actions taken when a message is transmitted or received,and in certain other situations Protocols are extensively used by computer networks

In layered communication system, one layer of a host deals with the task done by

or to be done by its peer layer at the same level on the remote host The task is eitherinitiated by layer at the lowest level or at the top most level If the task is initiated bythe-top most layer, it is passed on to the layer below it for further processing The lowerlayer does the same thing, it processes the task and passes on to lower layer If the task isinitiated by lower most layer, then the reverse path is taken There are two well-knowntypes: The open systems interconnection (OSI) reference model and the transmissioncontrol protocol/Internet protocol (TCP/IP) reference model

13

Trang 19

Chapter 2

The ISO1 OSI Reference Model

The Open Systems Interconnection model (OSI model2) is a conceptual model created bythe International Organization for Standardization which is characterizes and standardizes

a telecommunications or computing system’s communication operations regardless ofits underlying internal structure and technology Its objective is to ensure that diﬀerentcommunication systems can communicate with each other using standard communicationprotocols

From the practical implementation of transferring bits through a communicationschannel to the highest-level representation of data in a distributed application, themodel divides the ﬂow of data in a communication system into seven abstraction levels.Each intermediary layer provides a class of functionality to the layer above it while alsoreceiving service from the layer below Standard communication protocols are used

to implement classes of functionality in software

• Application layer: This layer provides users access to the OSI environment and todistributed information services

• Presentation layer: This layer provides independence to application processes fromdiﬀerences in data representation (syntax)

• This layer provides independence to application processes from diﬀerences in datarepresentation (syntax)

• Session layer: It provides the control structure for communication betweenapplications Establishment, management and termination of connections (sessions)between cooperating applications are the major responsibilities of this layer

• Transport layer: This layer supports reliable and transparent transfer of data betweentwo end points It also supports end-to-end error recovery and ﬂow control

• Network layer: This layer provides upper layers independence from data transmissionand switching technologies used to connect systems It is also responsible forthe establishment, maintenance and termination of connections Data link layer:The responsibility of reliably transferring information across the physical link isassigned to this layer It transfers blocks (frames) with necessary synchronization,error control and ﬂow control

• Physical layer: This layer is responsible for transmitting a stream of unstructuredbits over the physical medium It must deal with mechanical, electrical, functionaland procedural issues to access the physical medium

1International Organization for Standardization

2Available at https://www.cloudflare.com/

Trang 20

Chapter 2

Figure 2.1: The OSI reference model

TCP/IP 3 Reference Model

The functions of the TCP/IP architecture are divided into ﬁve layers The functions oflayers 1 and 2 are supported by bridges, and layers 1 through 3 are implemented by routers

• The application layer: The application layer is responsible for supporting networkapplications As computer and networking technologies change, the types ofapplications supported by this layer also change Each application, such as filetransmission, has its own specific module Many protocols are included in theapplication layer for diverse reasons, such as hyper-text transport protocol (HTTP)for Web searches, simple mail transfer protocol (SMTP) for electronic mail, andfile transfer protocol (FTP) for file transfer

• The transport layer: Application layer communications are transported between theclient and the server via the transport layer With two transport protocols: TCP andUDP (user datagram protocol), application layer communications can be transportedusing these protocols To transmit application layer messages to the destination,TCP oﬀers a guaranteed connection-oriented service TCP also has a congestionmanagement mechanism that allows a source to limit its transmission pace amidnetwork congestion by segmenting a large message into shorter parts The UDP3Available at https://docs.oracle.com/cd/E19683-01/806-4075/ipov-10/index.html

15

Trang 21

in the IP datagram that dictate how end systems and routers operate This layer alsocomprises routing protocols for determining the paths for sending packets between

a source and a destination A transport layer segment and a destination address areforwarded to the next layer protocol, IP, by the TCP or UDP protocol in a sourcehost The IP layer is now in charge of delivering the segment to the target host.The IP passes the segment to the transport layer within the host when the packetarrives at the destination host A packet is routed through a series of packet switchesbetween two sources and destinations by the internet layer

• The network layer: When forwarding a packet from one node (host or packet switch)

to the next, the network layer relies on the link layer’s services The network layer,

in particular, transmits the datagram down to the link layer at each node, whichdelivers it to the next node along the path Depending on the link layer protocolused on the link, the link layer transfers the datagram up to the network layer at thenext node

• The physical layer: is responsible for moving the bits in a frame from one node tothe next The protocols on this layer are dependent on the link as well as the actualtransmission medium (e.g., coaxial cable, twisted pair, etc.) Ethernet providesphysical layer protocols for each type of communication channel Protocols fortwisted-pair copper wire, coaxial cable, and optical ﬁbers, for example, are alldistinct For each of these protocols, bit forwarding takes place in a diﬀerent wayover the connection

Trang 22

Chapter 2

Figure 2.2: The TCP/IP architecture compare to the OSI architecture

An IDS is a device or software application which used to monitor network trafficfor suspicious activity, threats consideration and issues alerts in an entire networkingenvironment when such activity is discovered An IDS typically protects an entirenetwork from attacks such as Scanning Attacks, Asymmetric Routing, Buffer OverflowAttacks, Protocol-Specific Attacks, Malware and special one - Traffic Flooding IDSprovide defense at Network Layer (different from WAF 4 which provides defense atApplication Layer) There are two main types of intrusion detection software: NetworkIntrusion Detection System (NIDS) and Host Intrusion Detection System (HIDS)

By deploying IDS at strategic points throughout the network, it intended to coverthose places where traﬃc is most likely to be vulnerable to attack It observes networktraﬃc passing through the places on the network where it is deployed in a passive manner

At most basic level, NIDS examine network traﬃc, whereas HIDS examine actions anddata on the host devices

An IDS operates through a set of rules or signatures IDSs often look for known attacksignatures or aberrant deviations from predetermined standards These anomalous networktraﬃc patterns are subsequently forwarded up the stack to the protocol and applicationlayers of the OSI (Open Systems Interconnection) model for further investigation These4Web Application Firewall

17

Tiêu đề	Machine Learning Approaches to Cyber Security
Tác giả	Huynh Kien Van, Nguyen Duc Kien
Người hướng dẫn	TS. Nguyễn Đức Thái
Trường học	Vietnam National University - Ho Chi Minh City
Chuyên ngành	Computer Science
Thể loại	graduation thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	45
Dung lượng	1,09 MB