Cải tiến một số thuật toán trong miễn dịch nhân tạo cho phát hiện xâm nhập mạng

4.3 A fast negative selection algorithm based on r-chunk detector.. 4.4 A fast negative selection algorithm based on r-contiguous detector.. Despite its successfulapplication, NSA has so

Trang 1

MINISTRY OF EDUCATION VIETNAMESE ACADEMY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ||||||||||||

NGUYEN VAN TRUONG

IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR

NETWORK INTRUSION DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

IN MATHEMATICS

Hanoi - 2019

Trang 2

MINISTRY OF EDUCATION VIETNAMESE ACADEMY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ||||||||||||

NGUYEN VAN TRUONG

IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR

NETWORK INTRUSION DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

IN MATHEMATICS Major: Mathematical foundations for Informatics

Code: 62 46 01 10Scienti c supervisor:

1 Assoc Prof., Dr Nguyen Xuan Hoai

2 Assoc Prof., Dr Luong Chi Mai

Hanoi - 2019

Trang 3

First of all I would like to thank is my principal supervisor, Assoc Prof., Dr.Nguyen Xuan Hoai for introducing me to the eld of Arti cial Immune System Heguides me step by step through research activities such as seminar presentations,paper writing, etc His genius has been a constant source of help I am intrigued byhis constructive criticism throughout my PhD journey I wish also to thank my co-supervisor, Assoc Prof., Dr Luong Chi Mai She is always very enthusiastic in ourdiscussion promising research questions It is a pleasure and luxury for me to workwith her This thesis could not have been possible without my supervisors’ support

I gratefully acknowledge the support from Institute of InformationTechnology, Vietnamese Academy of Science and Technology, and from ThaiNguyen University of Education I thank the nancial support from the NationalFoundation for Science and Technology Development (NAFOSTED), ASEAN-European Academic University Network (ASEA-UNINET)

I thank M.Sc Vu Duc Quang, M.Sc Trinh Van Ha and M.Sc Pham DinhLam, my co-authors of published papers I thank Assoc Prof., Dr Tran Quang Anhand Dr Nguyen Quang Uy for many helpful insights for my research I thankcolleagues, especially my cool labmate Mr Nguyen Tran Dinh Long, in ITResearch & Development Center, HaNoi University

Finally, I thank my family for their endless love and steady support

Trang 4

Certi cate of Originality

I hereby declare that this submission is my own work under my scienti csuper-visors, Assoc Prof., Dr Nguyen Xuan Hoai, and Assoc Prof., Dr Luong ChiMai I declare that, it contains no material previously published or written byanother person, except where due reference is made in the text of the thesis Inaddition, I certify that all my co-authors allow me to present our work in this thesis

Hanoi, 2019 PhD student

Nguyen Van Truong

Trang 5

Contents

List of Figures

List of Tables

Notation and Abbreviation

INTRODUCTION

Motivation

Objectives

Problem statements

Outline of thesis

1 BACKGROUND 1.1 Detection of Network Anomalies

1.1.1 1.1.2 1.1.3 1.1.4 1.2 A brief overview of human immune system

1.3 AIS for IDS

1.3.1 1.3.2 1.4 Selection algorithms

1.4.1

Trang 6

1.5 Basic terms and de nitions

1.5.1 1.5.2 1.5.3 1.5.4 1.5.5 1.5.6 1.5.7 1.5.8 1.6 Datasets

1.6.1 1.6.2 1.6.3 1.6.4 1.7 Summary

2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SE-LECTION 2.1 Introduction

2.2 Related works

2.3 New Positive-Negative Selection Algorithm

2.4 Experiments

2.5 Summary

3 GENERATION OF COMPACT DETECTOR SET 3.1 Introduction

3.2 Related works

3.3 New negative selection algorithm

Trang 7

3.3.1 3.3.2

3.4 Experiments

3.5 Summary

4 FAST SELECTION ALGORITHMS 4.1 Introduction

4.2 Related works

4.3 A fast negative selection algorithm based on r-chunk detector

4.4 A fast negative selection algorithm based on r-contiguous detector

4.5 Experiments

4.6 Summary

5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NET-WORK SECURITY 5.1 Introduction

5.2 Related works

5.3 Hybrid positive selection algorithm with chunk detectors

5.4 Experiments

5.4.1 5.4.2 5.4.3 5.4.4 5.5 Summary

CONCLUSIONS Contributions of this thesis

Future works

Published works

Trang 8

BIBLIOGRAPHY

Trang 9

List of Figures

1.1 Classi cation of anomaly-based intrusion detection methods

1.2 Multi-layered protection and elimination architecture

1.3 Multi-layer AIS model for IDS

1.4 Outline of a typical negative selection algorithm

1.5 Outline of a typical positive selection algorithm

1.6 Example of a pre x tree and a pre x DAG

1.7 Existence of holes

1.8 Negative selections with 3-chunk and 3-contiguous detectors

1.9 A simple ring-based representation (b) of a string (a)

1.10 Frequency trees for all 3-chunk detectors

2.1 Binary tree representation of the detectors set generated from S

2.2 Conversion of a positive tree to a negative one

2.3 Diagram of the Detector Generation Algorithm

2.4 Diagram of the Positive-Negative Selection Algorithm

2.5 One node is reduced in a tree: a compact positive tree has 4 nodes (a) and its conversion (a negative tree) has 3 node (b)

2.6 Detection time of NSA and PNSA

2.7 Nodes reduction on trees created by PNSA on Net ow dataset

2.8 Comparison of nodes reduction on Spambase dataset

3.1 Diagram of a algorithm to generate perfect rcbvl detectors set

4.1 Diagram of the algorithm to generate positive r-chunk detectors set 55

Trang 10

4.2 A pre x DAG G and an automaton M

4.3 Diagram of the algorithm to generate negative r-contiguous detectors set.4.4 An automaton represents 3-contiguous detectors set .4.5 Comparison of ratios of runtime of r-chunk detector-based NSA to run-time of Chunk-NSA

4.6 Comparison of ratios of runtime of r-contiguous detector-based NSA toruntime of Cont-NSA

Trang 11

List of Tables

1.1 Performance comparison of NSAs on linear strings and ring strings 24

2.1 Comparison of memory and detection time reductions 39

2.2 Comparison of nodes generation on Net ow dataset 40

3.1 Data and parameters distribution for experiments and results comparison 49

4.1 Comparison of our results with the runtimes of previously published

algorithms 53

4.2 Comparison of Chunk-NSA with r-chunk detector-based NSA 63

4.3 Comparison of proposed Cont-NSA with r-contiguous detector-based NSA 64

5.1 Features for NIDS 71

5.2 Distribution of ows and parameters for experiments 73

5.3 Comparison between PSA2 and other algorithms 74

5.4 Comparison between ring string-based PSA2 and linear string-based PSA2 76

Trang 12

An alphabet, a nonempty and nite set of symbols

Set of all strings of length k on alphabet , where k is a

positive integer

Set of all strings on alphabet , including an empty string.Matching threshold

Set of all positive r-chunk detectors at position i

Set of all negative r-chunk detectors at position i

CONT(S; r)

L(X)

rcbvl

Set of all r-contiguous detectors

Set of all nonself strings detected by X

r-contiguous bit with variable length

Chunk Detector-Based Negative Selection Algorithm

Contiguous Detector-Based Negative Selection Algorithm

Detection RateDirected Acyclic GraphFalse Alarm RateGenetic AlgorithmHost Intrusion Detection SystemIntrusion Detection System

Trang 13

Negative Selection AlgorithmNegative Selection MutationPositive-Negative Selection AlgorithmPositive Selection Algorithm

Two-class Positive Selection AlgorithmParticle Swarm Optimization

Particle Swarm Optimization-Gravitational Search Algorithm

Real-valued NSASupport Vector MachinesTransmission Control ProtocolVariable length detector-based NSA

Trang 14

a constantly changing environment As a result, many researchers have attempted touse di erent types of approaches to build reliable intrusion detection system.

Computational intelligence techniques, known for their ability to adapt and

to exhibit fault tolerance, high computational speed and resilience against noisyinforma-tion, are hopefully alternative methods to the problem

One of the promising computational intelligence methods for intrusion detectionthat have emerged recently are arti cial immune systems (AIS) inspired by the biolog-ical immune system Negative selection algorithm (NSA), a dominating model of AIS,

is widely used for intrusion detection systems (IDS) [55, 52] Despite its successfulapplication, NSA has some weaknesses: 1-High false positive rate (false alarm rate)and false negative rate, 2-High training and testing time, 3-Exponential relationshipbetween the size of the training data and the number of detectors possibly generatedfor testing, 4-Changeable de nitions of "normal data" and "abnormal data" in dynamicnetwork environment [55, 79, 92] To overcome these limitations, trends of recentworks are to concentrate on complex structures of immune detectors, matchingmethods and hybrid NSAs [11, 94, 52]

Following trends mentioned above, in this thesis we investigate the ability ofNSA to combine with other classi cation methods and propose more e ective data

Trang 15

representations to improve some NSA’s weaknesses

Scienti c meaning of the thesis: to provide further background to improveper-formance of AIS-based computer security eld in particular and IDS in general

Reality meaning of the thesis: to assist computer security practicers orexperts implement their IDS with new features from AIS origin

The major contributions of this research are: Propose a new representation

of data for better performance of IDS; Propose a combination of existingalgorithms as well as some statistical approaches in an uniform framework;Propose a complete and non-redundant detector representation to archive optimaltime and memory complex-ities

2 The second problem is to propose algorithms that can reduce training timeand testing time in compared with all existing related algorithms

Trang 16

3 The third problem is to improve detection performance with respect toreduc-ing false alarm rates while keeping detection rate and accuracy rate ashigh as possible

Solutions of these problems can partly improve rst three weaknesses as listed inthe rst section Regarding to the last NSAs’ weakness about changeable de nitions

of "normal data" and "abnormal data" in dynamic network environment, weconsider it as a risk in our proposed algorithm and left it for future work

Logically, it is impossible to nd an optimal algorithm that can both reducetime and memory complexities and obtain best detection performance Theseaspects are always in con ict with each other Thus, in each chapter, we willpropose algorithms to solve each problem quite independently

The intrusion detection problem mentioned in this thesis can be informallystated as:

Given a nite set S of network ows which labeled with self (normal) ornonself (abnormal) The objective is to build classifying models on S that can labelexactly an unlabeled network ow s

In Chapter 2, a combination method of selection algorithms is presented.The proposed technique helps to reduce detectors storage generated in trainingphase Test-ing time, an important measurement in IDS, will also be reduced as adirect consequence of a smaller memory complexity Tree structure is used in thischapter (and in Chapter 5) to improve time and memory complexities

A complete and nonredundant detector set, also called perfect detectors set,

Trang 17

is necessary to archive acceptable self and nonself coverage of classi ers Aselection algorithm to generate a perfect detectors set is investigated in Chapter 3.Each detector in the set is a string concatenated from overlapping classical ones

Di erent from approaches in the other chapters, discrete structure of string-baseddetectors in this chapter are suitable for detection in distributed environment

Chapter 4 includes two selection algorithms for fast training phase Theoptimal algorithms can generate a detectors set in linear time with respect to size

of training data The experiment results and theoretical proof show that proposedalgorithms outperform all existing ones in term of training time In term of detectiontime, the rst algorithm and the second one is linear and polynomial, respectively

Chapter 5 mainly introduces a hybrid approach of positive selection algorithmwith statistics for more e ective NIDS Frequencies of self and nonself data (strings)are contained in leaves of trees representing detectors This information plays animportant role in improving performance of the proposed algorithms The hybridapproach came as a new positive selection algorithm for two-class classi cation thatcan be trained with samples from both self and nonself data types

Trang 18

How to apply remarkable features of HIS to archive scalable and robust IDS

is considered a researching gap in the eld of computer security In this chapter, weintroduce the background knowledge necessary to discuss the algorithmsproposed in following chapters that can partly ful ll the gap

Firstly, a brief introduction to network anomaly detection is presented Wethen overview HIS Next, immune selection algorithms, detectors, performancemetrics and their relevance are reviewed and discussed Finally, some populardatasets are examined

1.1 Detection of Network Anomalies

The idea of intrusion detection is predicated on the belief that an intruder’sbehavior is noticeably di erent from that of a legitimate user and that many unautho-rized actions are detectable [65] Intrusion detection systems (IDSs) are deployed as asecond line of defense along with other preventive security mechanisms, such as user

Trang 19

authentication and access control Based on its deployment, an IDS can act either

as a host-based or as a network-based IDS

1.1.1 Host-Based IDS

A Host-Based IDS (HIDS) monitors and analyzes the internals of acomputing system A HIDS may detect internal activity such as which programaccesses what resources and attempts illegitimate access, for example, an activitythat modi es the system password database Similarly, a HIDS may look at thestate of a system and its stored information whether it is in RAM or in the le system

or in log les or elsewhere Thus, one can think of a HIDS as an agent that monitorswhether anything or anyone internal or external has circumvented the securitypolicy that the operating system tries to enforce [12]

1.1.2 Network-Based IDS

A Network-Based IDS (NIDS) detects intrusions in network data Intrusionstypically occur as anomalous patterns Most techniques model the data in a sequentialfashion and detect anomalous subsequences The primary reason for theseanomalies is the attacks launched by outside attackers who want to gain unauthorizedaccess to the network to steal information or to disrupt the network In a typical setting,

a network is connected to the rest of the world through the Internet The NIDS readsall incoming packets or ows, trying to nd suspicious patterns For example, if a largenumber of TCP connection requests to a very large number of di erent ports areobserved within a short time, one could assume that there is someone committing aport scan at some of the computers in the network Port scans mostly try to detectincoming shell codes in the same manner that an ordinary intrusion detection systemdoes In addition to inspecting the incoming tra c, a NIDS also provides valuableinformation about intrusion from outgoing or local tra c Some attacks might even bestaged from the inside of a monitored network or network segment; and therefore, notregarded as incoming tra c at all The data available for intrusion detection systemscan be at di erent levels of granularity, like packet level traces or Cisco net ow data

Trang 20

The data is high dimensional, typically, with a mix of categorical as well as continuous numeric attributes Misuse-based NIDSs attempt to search for known intrusive patterns while an anomaly-based intrusion detector searches for unusual patterns Today, the intrusion detection research is mostly concentrated on anomaly-based network intrusion detection because it can detect both known and unknown attacks [12].

1.1.3 Methods

On the basis of the availability of prior knowledge, the detection mechanismused, the mode of performance and the ability to detect attacks, existing anomalydetection methods are categorized into six broad categories [41] as shown in Fig.1.1 This gure is adapted from [12]

Supervised Learning

Unsupervised Learning

Probabilistic Learning Anomaly

Detection

Soft Computing

Knowledge based

Combination Learners

Figure 1.1: Classi cation of anomaly-based intrusion detection methods

Trang 21

AIS is a fairly new research sub eld of Computational intelligence It wasconsidered as a system that acts intelligently: What it does is appropriate for itscircumstances and its goal; it is exible to changing environments and changinggoals; it learns from experience; also it makes appropriate choices givenperceptual limitations and nite computation [68].

Trang 22

1.1.4 Tools

IDS tools are used for purposes such as information gathering, victim cation, packet capture, network tra c analysis and visualization of tra c behavior.These tools for both commercial and free purposes can be exampli ed, such asSnort, Suricata, Bro, OSSEC, Samhain, Cisco Secure IDS, CyberCop, andRealSecure Some immune-related IDS tools including LISYS [10], which is based

identi-on TCP packages, and MILA [26], a multilevel immune learning algorithmproposed for novel pattern recog-nition

However, despite their initially promising and in uential properties, based IDSs never made it beyond the prototype stage [83] Two main issues thatimpeded the progress of immune algorithms were identi ed: large computationalcost to achieve acceptable coverage of the potentially anomalous region [54], andthe failure of these algorithms to generalize properly beyond the training set [79]

immune-1.2 A brief overview of human immune system

Mainly being inspired by the human immune system, researchers have devel-oped AISs intellectually and innovatively Physical barriers, physiological barriers, an innate immune system, and an adaptive immune system are main factors of a multi-layered protection architecture included in our human immune system; among which, the adaptive immune system being capable of adaptively recognizing speci c types of pathogens, and memorizing them for accelerated future responses is a complex of a variety of molecules, cells, and organs spread all over the body [46] Pathogens are for-eign substances like viruses, parasites and bacteria which attack the body Figure 1.2, adapted from [77], presents a multi-layered protection and elimination architecture.

T cells and B cells cooperate to distinguish self from nonself On the one hand,

T cells recognize antigens with the help of major histocompatibility complex (MHC)molecules Antigen presenting cells ingest and fragment antigens to peptides MHCmolecules transport these peptides to the surface of antigen presenting cells T cells,whose receptors bind with these peptide-MHC combinations, are said to recognize

Trang 23

Figure 1.2: Multi-layered protection and elimination architecture

antigens On the other hand, B cells recognize antigens by binding their receptorsdirectly to antigens The bindings actually are chemical bonds between receptorsand epitopes The more complementary the structure and the charge betweenreceptors and epitopes are, the more likely binding will occur The strength of thebond is termed a nity To avoid autoimmunity, T cells and B cells must pass anegative selection stage, where lymphocytes matching self cells are killed

Prior to negative selection, T cells undergo positive selection This is because

in order to bind to the peptide-MHC combinations, they must recognize self MHC rst.Thus, the positive selection will eliminate T cells with weak bonds to self MHC T cellsand B cells, which survive the negative selection, become mature, and enter the bloodstream to perform the detection task Since these mature lymphocytes have neverencountered antigens, they are naive Naive T cells and B cells can possibly auto-react with self cells, because some peripheral self proteins are never presented duringthe negative selection stage To prevent self-attack, naive cells need two signals inorder to be activated: one occurs when they bind to antigens, and the other is fromother sources as a con rmation Naive T helper cells receive the second signal frominnate system cells In the event that they are activated, T cells begin to clone Some

of the clones will send out signals to stimulate macrophages or cytotoxic T cells to killantigens, or send out signals to activate B cells Others will form memory T cells Theactivated B cells migrate to a lymph node In the lymph node, a B cell will clone itself

Trang 24

Meanwhile, somatic hyper mutation is triggered, whose rate is 10 times higherthan that of the germ line mutation, and is inversely proportional to the a nity.Mutation changes the receptor structures of o spring; hence o spring have to bind

to pathogenic epitopes captured within the lymph nodes If they do not bind, theywill simply die after a short time Whereas, in case they succeed in binding, theywill leave the lymph node and di erentiate into plasma or memory B cells

In summary, the HIS is a distributed, self-organizing and lightweight defensesystem for the body These remarkable features ful ll and bene t the design goals of

an intrusion detection system, thus resulting in a scalable and robust system [53]

1.3 AIS for IDS

1.3.1 AIS model for IDS

Figure 1.3 illustrates the steps necessary to obtain an AIS solution for a rity problem, as rstly envisioned by de Castro and Timmis [27] and latter adopted byFernandes et al [35] Firstly, the security domain of the system to model needs to beidenti ed Secondly,the immune entities that best t the needs of the system should bepicked from the immunological theories That should ease pointing out therepresentation of the entities In the step of the a nity measures one should take intoaccount a matching rule that outputs if two elements should bind

secu-Figure 1.3: Multi-layer AIS model for IDS

Trang 25

1.3.2 AIS features for IDS

According to Kim et al [55], AIS features can be illustrated and summarized

as follows

Firstly, a distributed IDS supports robustness, con gurability, extendibilityand scalability It is robust since the failure of one local intrusion detection processdoes not cripple the overall IDS It is also easy to con gure a system since eachintrusion detection process can be simply tailored for the local requirements of aspeci c host The addition of new intrusion detection processes running on di erentoperating sys-tems does not require modi cation of existing processes and hence

it is extensible It can also scale better, since the high volume of audit data isdistributed amongst many local hosts and is analyzed by those hosts

Secondly, a self-organizing IDS provides adaptability and global analysis out external management or maintenance, a self organizing IDS automatically detectsintrusion signatures which are previously unknown and/or distributed, and eliminatesand/or repairs compromised components Such a system is highly adaptive becausethere is no need for manual updates of its intrusion signatures as networkenvironments change Global analysis emerges from the interactions among a largenumber of varied intrusion detection processes

With-Next, a lightweight IDS supports e ciency and dynamic features A lightweightIDS does not impose a large overhead on a system or place a heavy burden on CPUand I/O It places minimal work on each component of the IDS The primary functions

of hosts and networks are not adversely a ected by the monitoring It also dynami-callycovers intrusion and non-intrusion pattern spaces at any given time rather thanmaintaining entire intrusion 8 and non-intrusion patterns

One more important feature is a multi-layered IDS which increasesrobustness The failure of one-layer defense does not necessarily allow an entiresystem to be compromised While a distributed IDS allocates intrusion detectionprocesses across several hosts, a multi-layered IDS places di erent levels ofsensors at one monitoring place

Additionally, a diverse IDS provides robustness A variety of di erent intrusion

Trang 26

detection processes spread across hosts will slow an attack that has successfully promised one or more hosts This is because an understanding of the intrusion process at one site provides limited or no information on intrusion processes at other sites.

com-Finally, it is a disposable IDS that increases robustness, extendibility and con urability A disposable IDS does not depend on any single component Any component can

g-be easily and automatically replaced with other components These properties are important in an e ective IDS, as well as being established properties of the HIS.

1.4 Selection algorithms

The main developments within AIS have focussed on three immunologicalthe-ories: clonal selections, immune networks and negative selections Negativeselection approaches are based on self-nonself discrimination in biology system.This property makes it attractive for computer and network security researchers Asurvey by G C Silva and D Dasgupta in [71] showed that in ve-year period 2008-

2013, NSA predom-inated all the other models of AIS in term of published papersrelating to both network security and anomaly detection This trend triggers formuch of the research work in this thesis

A model of AIS, positive selection algorithm (PSA), is also investigated.Under some conditions, we will prove in a follow section that PSA is adequate toNSA in term of anomaly detection performance

1.4.1 Negative Selection Algorithms

Negative selection is a mechanism employed to protect the body againstself-reactive lymphocytes Such lymphocytes can occur because the buildingblocks of antibodies are di erent gene segments that are randomly composed andundergo a fur-ther somatic hypermutation process Therefore, this process canproduce lymphocytes which are able to recognise self-antigens [85]

NSAs are among the most popular and extensively studied techniques in ar-ticial immune systems that simulate the negative selection process of the biologicalimmune system Stephanie Forrest et al [38] proposed an algorithmic model of this

Trang 27

Begin Generate Random Candidates

Yes

No Accept as new detector

Yes

Figure 1.4: Outline of a typical negative selection algorithm

Concept matching or recognition, are used both in the detector generation phase and in the anomaly detection phase Regardless of representation, a matching rule on a detector d and a data sample s can be informally de ned as a distance measure between

Trang 28

d and s within a threshold Matching threshold exposes the concept of partial matching: two points do not have to be exactly the same to be considered matching.

1 At the writing time of this thesis, the paper has been cited more than 2300 times

Trang 29

A partial matching rule can support an approximation or a generalization in the rithms The choice of the matching rule or the threshold in a matching rule must beapplication speci c and representation dependent [51] For real-valued representation,some popular rules are Euclidean distance and Manhattan distance In stringrepresen-tation, rcb(r-contiguous bits) matching rule and r-chunk matching rule are themost famous ones and they are formally presented in following section

algo-Since its introduction, NSA has had many applications such as in computer virus detection [37, 5], monitoring UNIX processes [36], anomaly detection [22, 26], intrusion detection [19, 54, 46, 59, 18, 93], scheduling [64], fault detection and diagnosis [45, 72], negative database [33, 98], negative authentication [25, 20] Moreover, NSA has been quite successfully applied in immunology where they are used as models to provide insight into fundamental principles of immunity and infection [15], and to illustrate the immunological processes such as HIV infection [56, 57].

The most signi cant characteristics of a NSA making its uniqueness and strength

are:

No prior knowledge of nonself is required [29]

It is inherently distributable; no communication between detectors is needed [30] It can hide the self concept [33]

Compared with other change detection methods, NSAs do not depend on theknowledge of de ned normal Consequently, checking activity of each site can

be based on a unique signature of each while the same algorithm is used over multiple sites

The quality of the check can be traded o against the cost of performing a check [38]

Symmetric protection is provided so the malicious manipulation on detector set can be detected by normal behavior of the system [38]

If the process of generating detectors is costly, it can be distributed to

multiple sites because of its inherent parallel characteristics

Trang 30

Detection is tunable to balance between coverage (matching probability) and the number of detectors [29]

1.4.2 Positive Selection Algorithms

Contrary to NSAs, PSAs have been less studied in the literature PSAs are mainly developed and applied in intrusion detection [23, 73, 44, 66], malware detec-tion [39], spam detection [81], and classi cation [40, 67] Stibor et al [80] argues that positive selection might have better detection performance than negative one How-ever, for problems and applications that the number of detectors generated by NSAs is much less than that of self samples, negative selection is obviously a better choice [51].

Similar to NSA, a PSA contains two phases: detector generation and detection In the detector generation phase (Fig 1.5.a), the detector candidates are generated by some random processes and matched against the given self sample set S The candi-dates that

do not match any element in S are eliminated and the rest are kept and stored in the detector set D In the detection phase (Fig 1.5.b), the collection of de-tectors are used to distinguish self from nonself If incoming data instance matches any detector, it is claimed

as self In other words, detectors modeling involves generating a

Generate Random Candidates No

Yes Accept as new detector

Yes End

Trang 31

Figure 1.5: Outline of a typical positive selection algorithm.

Trang 32

set of strings (patterns) that do not match any string in a training dataset too strongly (negative selection) or weakly match at least one string from the same dataset (positive selection) Having obtained the detectors, one usually examines a set of testing dataset (i.e., "antigens"), for which we search one or all matching detectors for classi cation.

1.5 Basic terms and de nitions

In selection algorithms, an essential component is the matching rule which termines the similarity between detectors and self samples (in the detector generationphase) and coming data instances (in the detection phase) Obviously, the matchingrule is dependent on detector representation In this thesis, both self and nonself cellsare represented as strings of xed length This representation is a simple and popularrepresentation for detectors and data in AIS, and other representations (such as realvalued) could be reduced to binary, a special case of string [42, 51]

de-1.5.1 Strings, substrings and languages

An alphabet is nonempty and nite set of symbols A string s 2 is a sequence

of symbols from , and its length is denoted by jsj A string is called empty string ifits length equals 0 Given an index i 2 f1; : : : ; jsjg, then s[i] is the symbol

at position i in s Given two indices i and j, whenever j i, then s[i : : : j] is thesubstring of s with length j i + 1 that starts at position i and whenever j < i, then s[i :: : j] is the empty string If i = 1, then s[i : : : j] is a pre x of s and, if j = jsj,

then s[i : : : j] is a su x of s For a proper pre x or su x s0 of s, we have in addition

js0j < jsj Given a string s 2 ‘, another string d 2 r with 1 r ‘, and an index i 2 f1; : : : ;

‘ r + 1g, we say that d occurs in s at position i if s[i : : : i + r 1] = d Moreover,concatenation of two strings s and s0 is s + s0

A set of strings S is called a language For two indices i and j, we de ne

S[i : : : j] = fs[i : : : j]js 2 Sg

Trang 33

1.5.2 Pre x trees, pre x DAGs and automata

A pre x tree T is a rooted directed tree with edge labels from where for all c 2 ,every node has at most one outgoing edge labeled with c For a string s, we write s 2

T if there is a path from the root of T to a leaf such that s is the concatenation of thelabels on this path The language L(T ) described by T is de ned as the set of allstrings that have a nonempty pre x s 2 T For example, for T as in Fig 1.6.a we have

0 2 T and 10 2 T , but 1 62T Furthermore, 0 2 L(T ), 01 2 L(T ) since 0 2 T and 1162L(T ) since no pre x of 11 lies in T Trees for self dataset and nonself dataset arecalled positive trees and negative trees, respectively

A pre x DAG D is a directed acyclic graph with edge labels from , where againfor all c 2 , every node has at most one outgoing edge labeled with c Similar to pre xtrees, the terms root and leaf used to refer to a node without incoming and outgoingedges, respectively We write s 2 D if there is a path from a root node to a leaf node in

D that is labeled by s Given n 2 D, the language L(D; n) contains all strings that have

a nonempty pre x that labels a path from n to some leaf For instance, if D is the DAG

in Fig 1.6.b and n is its lower left node, then L(D; n) consists of all strings starting with

11 Moreover, we de ne L(D) = [n is a root of DL(D; n)

A nite automaton is a tuple M = (Q; q0; Qa; ; ), where Q is a set of stateswith a distinguished initial state q0 2 Q; Qa Q the set of accepting states, thealphabet of M, and Q Q the transition relation Furthermore, we assume that thetransition relation is unambiguous: for every q 2 Q and every c 2 there is at mostone q0 2 Q with (q; c; q0) 2 It is common to represent the transition relation as agraph with node set Q (with the initial state and the accepting states highlightedproperly) and labeled edges (a c-labeled edge from q to q0 if q0 2 Q with (q; c; q0)

2 ) An automaton M is said to accept a string s if its graph contains a path from q0

to some q 2 Qa whose concatenated edge labels equal s (note that this path maycontain loops) The language L(M) contains all strings accepted by M

A pre x DAG can be turned into a nite automaton to decide the membership

of strings in languages The details steps of this process is presented in Chapter 4

a

Trang 34

deteis dependent on detector representation For string based AIS, the chunk and contiguous detectors are among the most common matching rules A r-chunk match-ing rule can be seen as a generalisation of the r-contiguous matching rule, which helps AIS to achieve better results on data where adjacent regions of the input data sequence are not necessarily semantically correlated, such as in network data packets [9].

r-An important di erence between rcb and r-chunk matching rules is holes, or theundetectable strings, that they may induce This concept is presented in Section 1.5.5

Given a nonempty alphabet and nite set of symbols, positive and negative chunk detectors, r-contiguous detectors, rcbvl-detectors could be de ned as follows:

r-De nition 1.1 (Positive r-chunk detectors) Given a self set S ‘, a tuple (d; i) of

a string d 2 r, where r ‘, and an index i 2 f1; :::; ‘ r + 1g is a positive r-chunkdetector if there exists a s 2 S such that d occurs in s

De nition 1.2 (Negative r-chunk detectors) Given a self set S ‘, a tuple (d; i) of astring d 2 r, r ‘, and an index i 2 f1; :::; ‘ r + 1g is a negative r-chunk detector if ddoes not occurs any s 2 S

Although some proposed approaches in following chapters can beimplemented on any nite alphabet, binary strings used in all examples are binary, =f0; 1g, just for easy understanding

Example 1.1 Let ‘ = 6, r = 3 Given a set S of ve self strings: s1 = 010101, s2 =

111010, s3 = 101101, s4 = 100011, s5 = 010111 The set of some positive r-chunk

Trang 35

detectors is f(010,1), (111,1), (101,2), (110,2), (010,3), (101,3), (101,4), (010,4),(111,4))g The set of some negative r-chunk detectors is f(000,1), (001,1), (011,1),(001,2), (010,2), (100,2), (000,3), (100,3), (000,4), (001,4), (100,4)g

De nition 1.3 Given a self set S ‘, a string d 2 ‘ is a r-contiguous detector if d[i; : : : ;

i + r 1] does not occurs any s 2 S for all i 2 f1; :::; ‘ r + 1g

Example 1.2 Let ‘ = 5, r = 3 Given a set of 7 self strings S = f01111, 00111, 10000,

10001, 10010, 10110, 11111g The set of all 3-contiguous detectors is f01011,11011g This example is adapted from [32]

We also use the following notations:

Dpi = f(d; i)j(d; i) is a positive r-chunk detectorg is set of all positive r-chunkdetectors at position i, i = 1; : : : ; ‘ r + 1

Dni = f(d; i)j(d; i) is a negative r-chunk detectorg is set of all negative r-chunkdetectors at position i, i = 1; : : : ; ‘ r + 1

CHU N Kp(S; r) = [‘i=1r+1Dpi is set of all positive r-chunk detectors

CHU N K(S; r) = [‘i=1r+1Dni is set of all negative r-chunk detectors

CONT(S; r) is the set of all r-contiguous detectors that do not match any string in S

For a given detectors set X, L(X) is the set of all nonself strings detected by

X We also say that ‘ n L(X) is the set of all self strings detected by X

Example 1.3 Let ‘ = 5, matching threshold r = 3 Suppose that we have the set S

of six self strings s1 = 00000; s2 = 00010; s3 = 10110; s4 = 10111; s5 = 11000; s6 =

11010 Dp1 = f(000,1), (101,1), (110,1)g (Dp1 is set of all left most substrings oflength r of all s 2 S), Dn1 = f(001,1), (010,1), (011,1), (100,1), (111,1)g, Dp2 =f(000,2), (001,2), (011,2), (100,2), (101,2)g, Dn2 = f(010,2), (110,2), (111,2)g, Dp3 =f(000,3), (010,3), (110,3), (111,3)g, Dn3 = f(001,3), (011,3), (100,3), (101,3)g (notethat Dpi [ Dni = 3, i = 1, 2, 3)

Trang 36

In other words, a triple (d; i; j) is a rcbvl

detectors (d1; i), , (dj i+1; j) that dk, dk+1

of r-chunk detectors set in Exam-ple 1.1 (45 bits)

Matching threshold r plays an important role in selection algorithms Thevalue of r can be used to balance between under tting and over tting Ourproposed methods in Chapter 5 investigate this value in combination with simplestatistics for better detection performance

1.5.4 Detection in r-chunk detector-based positive selection

It could be seen from Example 1.3 that L(CHU N Kp(S; r)) = f00100, 00101,

01100, 01101, 11100, 11101g 6= L(CHU N K(S; r)), so the detection coverage of Dn isnot the same as that of Dp This is undesirable for the combination of PSA and

Trang 37

With this new detection semantic, the following theorem on the equivalence

of detection coverage of r-chunk type PSA and NSA could be stated

Theorem 1.1 (Detection Coverage) The detection coverage of positive andnegative selection algorithms coincide

L(CHU N Kp(S; r)) = L(CHU N K(S; r))Proof From the description of NSAs (see Fig 1.4), if a new data instance matches

a negative r-chunk detector, then it is claimed as nonself, otherwise it is claimed asself Obviously, it is dual to the detection of new data instances in positive selection

Fig 1.7 illustrates the existence of holes in a self and a nonself spacecomprised by self and detector strings The string universe ‘ is a squared region Eachdark circle represents a detector and a grid shape in the middle is self The universe is

Trang 38

Figure 1.7: Existence of holes

classi ed by the detector set as self (grid region and holes - white region) andnonself (dark region covered by all circles)

In fact, holes are not a "problem", but as pointed out by Stibor et al [78], theyare a necessary property of the selection algorithms Without holes, the algorithmswould do nothing but memorize the training data for classi cation naively

Fig 1.8 shows a set of seven self strings as in Example 1.2, S f0; 1g5 (left)along with CHUNK(S; 3) (middle) and CONT(S; 3) (right) For both detector types,the induced bipartitionings of the shape space f0; 1g5 are illustrated with stringsthat are classi ed as nonself having a gray background and strings that are classi

ed as self having a white background Bold strings are members of the self-set.Holes are the strings that are classi ed as self but do not occur in the self-set S(non-bold, non-shaded strings) This gure is adapted from [32]

1.5.6 Performance metrics

We used three metrics to evaluate each machine learning technique Detectionrate (DR), Accuracy rate (ACC) and False alarm rate (FAR) are de ned as:

ACC =

Trang 40

Figure 1.8: Negative selections with 3-chunk and 3-contiguous detectors

Where TP (True Positive) is the number of true positives (correctly classi ed

as nonself), TN (True Negative) is the number of true negatives (correctly classi ed

as self), TP (False Positive) is the number of false positives (classi ed as nonself,actually self), and FN (False Negative) is the number of false negatives (classi ed

as self, actually nonself)

We used 10-fold cross-validation technique and holdout one to evaluate ourap-proaches in experiments Regarding the former, the dataset was randomlypartitioned into 10 subsets Of the 10 subsets, a single subset was retained for testing,and the others were used as training data The process was then repeated 10 times,with each of the 10 subset used exactly once as the testing data The 10 results fromthe folds were then averaged to produce a single performance Regarding the later,dataset is split into two groups: training set used to train the classi er and test set (orhold out set) used to estimate the performance of classi er

1.5.7 Ring representation of data

As we known, most of AIS-based applications use two types of data tion: string and real-valued vector For both popular types, representations are linear structure of symbols or numbers They may omit information at the edges (the begin

Định dạng
Số trang	116
Dung lượng	0,98 MB