An ensemble feature selection algorithm for machine learning based intrusion detection system

An Ensemble Feature Selection Algorithm for Machine Learning Based Intrusion Detection System An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System Phuoc Cuong[.]

Trang 1

An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System

Phuoc-Cuong Nguyen∗, Quoc-Trung Nguyen†, Kim-Hung Le‡ University of Information Technology, Vietnam National University Ho Chi Minh City

Ho Chi Minh, Vietnam Email:∗18520545@gm.uit.edu.vn,†18521553@gm.uit.edu.vn,‡hunglk@uit.edu.vn

Abstract—In recent years, we have witnessed the significant

growth of the Internet along with emerging security threats

A machine learning-based Intrusion Detection System (IDS)

is widely employed to detect cyber attacks by continuously

monitoring network traffic However, the diversity of network

features considerably affected the accuracy and training time

of the IDS model In this paper, a lightweight and effective

feature selection algorithm for IDS is proposed This algorithm

combines the advantages of both Random Forest and AdaBoost

algorithms The evaluation results on popular datasets (

NSL-KDD, UNSW-NB15, and CICIDS-2017) show that our proposal

outperforms existing feature selection algorithms regarding the

detection accuracy and the number of selected features

Index Terms—Intrusion Detection System, feature selection

algorithm, machine learning, random forest, adaboost

I INTRODUCTION

The Internet has exponentially increased and transformed

every aspect of our daily lives According to IWS (Internet

World Stats), from June 2017 to March 2021, Internet users

in-creased from 3885 million to 5168 million This is a significant

number because it shows that 65.6% of the world’s population

uses the Internet [1] For most people, the above numbers are

nothing special, but it is like a gold mine for cybercriminals

to exploit A specific example for this is when the covid

pandemic broke out, the number of Internet users skyrocketed

when many jobs were moved online; taking advantage of that,

the attackers kept exploiting and attacking Internet users [2]

According to McAfee’s statistics for the global damage caused

by cyberattacks, in 2018, the damage caused by cybercrime

was $500 billion [3] By 2020, this number has nearly doubled,

about $945 billion Therefore, the demand for the use of IDS

systems is increasing day by day

An IDS is a device or software running on internet

gate-ways to detect malicious activity or policy violations

Cyber-attacks usually aim at our digital properties When an attack

occurs, they try to obtain confidential data, modify it, or make

the system stop providing service Therefore, IDS plays an

essential role in detecting and preventing large numbers of

security threats and maintaining confidentiality, integrity, and

availability [4] However, most existing IDS cannot handle

complex and continuous variable attacks Classical IDS (e.g.,

firewalls, access control mechanisms) still have limitations in

comprehensively protecting the network and system from

com-plex attacks, for example, DDOS [5] To solve this problem,

applying machine learning is a potential solution because it

can increase detection intensity, reduce the false alarm rate, and better adapt to ensemble attacks

An IDS system has to deal with huge volumes of data, in-cluding false positives and incompatible or redundant features These features not only reduce the detection process but also consume a significant amount of resources Thereby, setting

a new development direction on using feature selection could increase accuracy, training, and testing speed [6] It also helps

to solve some of the common problems in IDS by detecting confounding features, resulting in lowering operating costs and storage space

The filter method is a popular approach to select the features for IDS It selects the best sub-feature set by using a different technique called variable ranking All features are scored by

a suitable ranking criterion based on a threshold Any feature that has a score below the threshold is removed [7] However, the performance of the filter method highly depends on the threshold that is data-specific It means that choosing the correct threshold is very challenging To solve this issue, in this paper, we introduce a novel feature selection algorithm that combines two ensemble algorithms: Random forest and Ada boost At first, we select sub-feature set S0, then apply ensemble algorithm and use our criterion to evaluate We repeat this process until covering most of the cases Then we use a clustering algorithm to select out those sub-features sets that tended to provide the best performance

The rest of this paper is organized as follows The related works are presented in Section II We briefly introduce our proposal in Section III Section IV describes the evaluated datasets and experiment results In Section V, we conclude the whole of our work

II RELATED WORKS

Paulo M Mafra et al proposed an IDS based on Genetic and SVM algorithms that can improve SVM parameters [8]

In other words, the model uses Genetics as the optimization algorithm to maximize the performance of the SVM The average detection accuracy rate is recorded as 80.14% by this model Similarly, the authors in [9] proposed a similar model

of IDS using Genetics to optimize the parameters of SVM and also as a feature selection The functional suitability is developed for the Genetic algorithm in the paper that evaluated chromosomes with maximum accuracy and minimum number

of features [10] proposed a wrapper-based feature selection

Trang 2

method as a multi-objective optimization algorithm and used

an unsupervised clustering method based on Growing

Hierar-chical Self-Organizing Maps (GHSOMSs) SOM is known as

one of the most used artificial neural networks unsupervised

learning models The study selects 25 features and shows an

accuracy rate of 99.12 ± 0.61% and an FP rate of 2.24 ±

0.41% The papers show an improvement over IDS models

with filter-based feature selection and IDS without feature

selection

The authors in [11] developed an IDS that uses a set

classifier for feature selection This framework combines the

bat algorithm (BA) and correlation-based feature selection

(CFS) In addition, the set classifier is built using Random

Forest and Forest-based Penalizing Attributes Test results

using dataset CIC-IDS2017 The test uses various performance

metrics, including false-positive, true-positive, detection,

ac-curacy, false alarm, and Matthews correlation coefficient The

results indicate that the combined CFSBA approach results in

high accuracy of 96.76%, a detection rate of 94.04%, and a

low false alarm level of 2.38%

Rikhtegar et al developed a feature selection model which

based on an associative learning mechanism [12] IDS

fea-ture selection and clustering mechanism They employed the

support vector machine (SVM) and K-Medoids clustering

algorithm in turn In addition, the approach also uses the

Naive Bayes classifier and uses the KDD CUP99 dataset

for evaluation The proposed model is evaluated using three

essential performance metrics: accuracy, detection rate, and

alarm rate The performance metrics are generated using

TNs (true negatives), FPs (false positives), and FNs (false

negatives) The test results are compared with three other

feature selection methods The test results show that the

pro-posed hybrid normalization method produces better accuracy

(91.5%), detection rate (90.1%), and false alarm rate (6.36%)

Shadi Aljawarneha et al developed a hybrid model used

to estimate the intrusion scope threshold degree based on the

network transaction data’s optimal features that were made

available for training [7] The experiments are tested on

the NSL-KDD datasets with significant results The accuracy

of the proposed modal was 99.81% for the binary class

and 98.56% for the multiclass However, this method has

some issues with high false and low false-negative rates,

which are addressed using a hybrid approach: Vote algorithm

with Information Gain and a hybrid algorithm consisting of

the following classifiers: J48, Meta Pagging, RandomTree,

REPTree, AdaBoostM1, DecisionStump, and naive Bayes The

final result is excellent with better accuracy, high false-negative

rate, and low false-positive rate

[13] proposed an IDS based on feature selection and

clus-tering algorithm using filter and wrapper methods - named

fea-ture grouping based on linear correlation coefficient (FGLCC)

algorithm and cuttlefish algorithm (CFA), respectively The

de-cision tree is used as the classifier in the proposed method For

performance verification, the proposed method was applied on

KDD Cup 99 large data sets The results are fascinating, with

high accuracy of 95.03% and a detection rate of 95.23% with

a low false-positive rate of 1.65%

III ENSEMBLE FEATURE SELECTION

The primary idea of our proposal is to combine the ad-vantages of both Random Forrest and Adaboost to maximize the feature selection If it is a large dataset with many noise samples, Ada Boost could be selected; otherwise, Random Forest is more suitable if we have a smaller, less noise dataset Comparing with deep-learning approach, our proposal achieves similar accuracy but significantly faster More detail about the algorithm can be described as below:

Proposed model’s algorithm Input: Dataset D

Ouput: Selected features set Sbest initialize: rounds R, Gbest[R], C[accuracy, running time] for n=1 to R: randomize F(F0, F1, , Fn-1)

RF = RandomForest(D, F),

AB = AdaBoost(D, F)

Gn = compare(RF and AB) Gbest.append(Gn)

C = cluster(Gbest) Sbest = MostFrequent(C[high accuracy, low running time]) return Sbest

Random Forest is a tree-based algorithm that consists of many decision trees to create an ensemble method The primary purpose is to optimize the accuracy much better than one Decision tree alone Each tree is built from different sub-contributes with different sub-datasets In this way, these trees protect each other from their own errors In detail, the algorithm can be described as below:

• Step 1: Select randomly n features from the original features set

• Step 2: Build Decision Tree based on n features selected, back to step 1

• Step 3: Repeating until we get a large number of the decision tree

• Step 4: Select randomly m decision tree in tree system With each test sample, each decision tree predicts which sample belongs to which class The final result is the side received most of the votes

Random forest is suitable as our dataset has many missing values The downside of this algorithm is performance since

it deals with a large number of the decision tree More about this algorithm can be described as below:

Trang 3

Training set

Test set

Training Sample 1

Training Sample 2

Training Sample n

Decision Tree 1

Decision Tree 2

Decision Tree n

Voting

Prediction

AdaBoost is an ensemble algorithm designed to improve

accuracy by adjusting the weight attribute of each weak learner

(like a small decision tree) on the same dataset The prediction

uses a weighted majority vote from all weak learners in its

system The overall algorithm can be described as below:

• Step 1: Select and build weak learners (small decision

tree); initialize weight value for each tree is 1/N for N

trees

• Step 2: Select sub training set and start training

• Step 3: For those samples which were predicted right

from the last round, their weight will be decreased

• Step 4: Increase weight value from samples that were

predicted wrong

• Step 5: Back to step 2 with these new weight values

Repeating steps 3 and 4, Adaboost forces weak learners to

concentrate on those samples predicted wrong, improving

overall accuracy

Adaboost takes less time to implement The trade-off is that

its accuracy is not good as Random Forest, but we can use it

in some cases to maximize the performance of our algorithm

More details about Adaboost from below pseudo-code:

IV RESULTS ANDDISCUSSION

A Dataset

1) DAPRA KDDCUP99 dataset:

The DARPA dataset was originally developed in 1998,

aiming to improve research and survey in intrusion

detection KDDCUP99 is an upgraded version of the

DARPA 1999 dataset, used to develop an intrusion

detection system to distinguish between bad and good

connections The dataset is mainly designed to detect

in-trusions in the network through simulation in a military

environment KDDCUP99 is designed and developed

using DARPA98 IDS and is used to simulate four

The boosting algorithm AdaBoost Given: (x 1 , y 1 ), , (x m , y m )where x i ∈ X, y i ∈ {−1, +1}.

Initialize: D 1 (i) = 1/mfor i=1, ,m.

- Train weak learner using distribution D t

- Get weak hypothesis h t X → −1, +1.

- Aim: Select h t with low weighted error: ( t ) = P r(i D t )[h t (x i ) 6= y i ]

- Choose α t =12ln(1−t

t ).

- Update, for i=1, ,m: D(t + 1)(i) = Dt (i)exp(−αtyiht(xi))

Z t where Z t is a normalization factor(chosen so that D ( t + 1)will be a distribution) Output the final hypothesis:

H(x)=sign( P T

t=1 α t h t (x)) The process can be explained as follows:

Firstly:

Given a training set containing m samples where all x inputs are an element

of the total set X and where the output y is an element of a set consisting of only two values, -1 (negative class) and 1 ( positive class)

or 1) and choose the classifier with the lowest number of error classifications.

different types of attacks These attacks can be classified into four main categories:

• Denial of Service Attacks (DoS): The attacker tries

to limit network usage by disrupting service avail-ability to intended users

• User to Root Attacks (U2R): This type of attack occurs when an attacker gains access from a normal user account and tries to gain root access through system vulnerabilities

• Remote to Local attacks (R2L): The attacker does not have an account on the local system but tries

to gain access through sending network packets to exploit vulnerabilities and gain access as a local user

• Probing attacks: This happens when an attacker scans a network to gather information about the system in order to use it to avoid system security control

TABLE I KDDCUP99 T RAINING D ATASET D ISTRIBUTION

# of instances Percentage % Normal 97,277 19.69%

Total 494,019 100%

2) NSL-KDD dataset: The NSL-KDD dataset is an im-proved version of KDDCUP99 It has been suggested to solve some of the problems of KDDCUP99 NSL-KDD has a fair amount of records in its training and testing set It has the same features as the original KDDCUP99

It is an effective standard by which researchers compare their proposed IDSs Some of the improvements of this dataset are:

Trang 4

• There are no redundant records in the training set,

so the classifier will not produce any misleading

results

• There are no duplicate records in the testing set,

which results in a better Reduction ratio

• The number of selected records from each group

of difficulty levels is inversely proportional to the

percentage of records in the original KDD dataset

3) UNSW-NB15NB15 dataset:

The raw network packets of the UNSW-NB 15 Dataset

was created by the IXIA PerfectStorm tool in the Cyber

Range Lab of UNSW Canberra for generating a hybrid

of real modern normal activities and synthetic

contem-porary attack behaviors This data set has a hybrid of

the real modern normal, and the recent synthesized

attack activities of the network traffic include fuzzes,

analysis, backdoors, DoS, exploits, Generic,

Reconnais-sance, Shellcode, Worms There are 49 features that have

been developed using Argus, Bro-IDS tools, and twelve

algorithms that cover characteristics of network packets

In contrast, the existing benchmark data sets such as

KDD98, KDDCUP99, and NSLKDD, realized a limited

number of attacks and information of outdated packets

The tcpdump tool is used to capture network traffic in

the form of packets To capture 100 GBs, the simulation

period was 16 hours on Jan 22, 2015 and 15 hours on

Feb 17, 2015 Each pcap file is divided into 1000MB

using the tcpdump tool To create reliable features from

the pcap files, Argus6 and Bro-IDS7 tools are used

Twelve algorithms were developed using a C# language

to analyze the flows of the connection packets in-depth

B Resutls

In this section, we show the results when using our proposed

modal on three datasets, including NSL-KDD, UNSW-NB15

and CICIDS-2017

TABLE II

T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL

# of features Loss Accuracy Time/step Time/epoch

TABLE III

As shown in Table IV-B, IV-B, IV-B, our proposal could

improve the model performance (accuracy, loss and training

time) on all datasets For example, as with the dataset

CI-CIDS2017, the accuracy is stable, but the training time of the

TABLE IV

algorithm is significantly reduced from 1600 seconds to 535 seconds In contrast, with the UNSW-NB15 dataset, although the training time did not decrease, the algorithm’s accuracy improved from 0.6924 to 0.7774 In the case of the NSLKDD dataset, the result has an improvement in both accuracy and training time

To demonstrate the effectiveness of our proposal over state-of-the-art methods, we compare our method with Sigmoid PIO and Cosine PIO [14], which are the newest feature selection methods As shown in Table IV-B and Table IV-B, our method

is more effective than the PIO in the KDDCUP99 dataset when only six features are selected, but the accuracy is still guaranteed Similar result in the NSLKDD dataset, the number

of features selected is more than the Cosine PIO method but less than the Sigmoid PIO, but the accuracy is still much higher than the two above methods There comparison results again demonstrated the effectiveness of our proposal

V CONCLUSION

In this paper, we introduced an ensemble feature selec-tion algorithm for IDS This algorithm aims at boosting the detection accuracy of IDS whereas reducing the number of networks features required to build the IDS model Evaluating NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets shows that our proposal could reduce the number of features from

83 to 20, 43 to 14, and 40 to 14, respectively As a result, the model training time is significantly reduced Furthermore, the proposed algorithm is more effective than related works demonstrated by achieving better accuracy with fewer selected features

REFERENCES [1] R Morrar, H Arman, and S Mousa, “The fourth industrial revolution (industry 4.0): A social innovation perspective,” Technology Innovation Management Review, vol 7, no 11, pp 12–20, 2017.

[2] R P Singh, M Javaid, A Haleem, and R Suman, “Internet of things (iot) applications to fight against covid-19 pandemic,” Diabetes & Metabolic Syndrome: Clinical Research & Reviews, vol 14, no 4, pp 521–524, 2020.

[3] H S Brar and G Kumar, “Cybercrimes: A proposed taxonomy and challenges,” Journal of Computer Networks and Communications, vol.

2018, 2018.

[4] K Khan, A Mehmood, S Khan, M A Khan, Z Iqbal, and W K Mashwani, “A survey on intrusion detection and prevention in wireless ad-hoc networks,” Journal of Systems Architecture, vol 105, p 101701, 2020.

[5] A Aldweesh, A Derhab, and A Z Emam, “Deep learning approaches for anomaly-based intrusion detection systems: A survey, taxonomy, and open issues,” Knowledge-Based Systems, vol 189, p 105124, 2020 [6] S Maza and M Touahria, “Feature selection algorithms in intrusion detection system: A survey,” KSII Transactions on Internet and Infor-mation Systems (TIIS), vol 12, no 10, pp 5079–5099, 2018.

Trang 5

TABLE V

T HE COMPARISION RESULTS WITH THE D ECISION T REES MODEL ON THE KDDCUP99 DATASET

Technique # of features Accuracy Selected features Sigmoid PIO 10 0.869 [3, 4, 6, 11, 13, 18, 23, 36, 37, 39]

Cosine PIO 7 0.883 [3, 4, 6, 13, 23, 29, 34]

TABLE VI

T HE COMPARISION RESULTS WITH THE D ECISION T REES MODEL ON THE NSLKDD DATASET

Technique # of features Accuracy Selected features

Sigmoid PIO 18 0.869 [1, 3, 4, 5, 6, 8, 10, 11, 12, 13, 14, 15, 17, 18, 27, 32, 36, 39, 41] Cosine PIO 5 0.883 [2, 6, 10, 22, 27]

Our method 14 0.980 [2, 3, 6, 8, 10, 11, 13, 23, 27, 30, 32, 35, 36, 39]

[7] S Aljawarneh, M Aldwairi, and M B Yassein, “Anomaly-based

in-trusion detection system through feature selection analysis and building

hybrid efficient model,” Journal of Computational Science, vol 25, pp.

152–160, 2018.

[8] P M Mafra, V Moll, J da Silva Fraga, and A O Santin,

“Octopus-iids: An anomaly based intelligent intrusion detection system,” in The

IEEE symposium on Computers and Communications IEEE, 2010, pp.

405–410.

[9] L Zhuo, J Zheng, X Li, F Wang, B Ai, and J Qian, “A genetic

algorithm based wrapper feature selection method for classification of

hyperspectral images using support vector machine,” in Geoinformatics

2008 and Joint Conference on GIS and Built Environment: Classification

of Remote Sensing Images, vol 7147 International Society for Optics

and Photonics, 2008, p 71471J.

[10] E De la Hoz, E De La Hoz, A Ortiz, J Ortega, and A

Mart´ınez-´

Alvarez, “Feature selection by multi-objective optimisation: Application

to network anomaly detection by hierarchical self-organising maps,”

Knowledge-Based Systems, vol 71, pp 322–338, 2014.

[11] Y Zhou, G Cheng, S Jiang, and M Dai, “Building an efficient intrusion

detection system based on feature selection and ensemble classifier,”

Computer networks, vol 174, p 107247, 2020.

[12] L Khalvati, M Keshtgary, and N Rikhtegar, “Intrusion detection based

on a novel hybrid learning approach,” Journal of AI and data mining,

vol 6, no 1, pp 157–162, 2018.

[13] S Mohammadi, H Mirvaziri, M Ghazizadeh-Ahsaee, and H

Karim-ipour, “Cyber intrusion detection by combined feature selection

algo-rithm,” Journal of information security and applications, vol 44, pp.

80–88, 2019.

[14] H Alazzam, A Sharieh, and K E Sabri, “A feature selection algorithm

for intrusion detection system based on pigeon inspired optimizer,”

Expert systems with applications, vol 148, p 113249, 2020.

Tiêu đề	An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System
Tác giả	Phuoc-Cuong Nguyen, Quoc-Trung Nguyen, Kim-Hung Le
Trường học	University of Information Technology, Vietnam National University Ho Chi Minh City
Chuyên ngành	Computer Science / Information Security
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	5
Dung lượng	182,6 KB